Working in linux/shell env, how can I accomplish the following:
text file 1 contains:
1
2
3
4
5
text file 2 contains:
cat file1 file2 | sort -u > unique
How about:
diff file_1 file_2 | grep '^>' | cut -c 3-
This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.
The files don't even need to be sorted.
If you are really set on doing this from the command line, this site (search for "no duplicates found") has an awk
example that searches for duplicates. It may be a good starting point to look at that.
However, I'd encourage you to use Perl or Python for this. Basically, the flow of the program would be:
findUniqueValues(file1, file2){
contents1 = array of values from file1
contents2 = array of values from file2
foreach(value2 in contents2){
found=false
foreach(value1 in contents1){
if (value2 == value1) found=true
}
if(!found) print value2
}
}
This isn't the most elegant way of doing this, since it has a O(n^2) time complexity, but it will do the job.
$ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2
6
7
Explanation of how the code works:
Explanation of details:
FNR
is the current file's record numberNR
is the current overall record number from all input filesFNR==NR
is true only when we are reading file1$0
is the current line of texta[$0]
is a hash with the key set to the current line of texta[$0]++
tracks that we've seen the current line of text!($0 in a)
is true only when we have not seen the line textI was wondering which of the following solutions was the "fastest" for "larger" files:
awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeX
awk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2 # awk2 by ghostdog74
comm -13 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)
grep -v -F -x -f file1 file2
Results of my benchmarks in short:
grep -Fxf
, it's much slower (2-4 times in my tests).comm
is slightly faster than join
.comm
and join
are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)comm
probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was
# Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU
$ wc file1 file2
321599 321599 8098710 file1
321603 321603 8098794 file2
Typical results of fastest runs
awk2: real 0m1.145s user 0m1.088s sys 0m0.056s user+sys 1.144
awk1: real 0m1.369s user 0m1.324s sys 0m0.044s user+sys 1.368
comm: real 0m0.980s user 0m1.608s sys 0m0.184s user+sys 1.792
join: real 0m1.080s user 0m1.756s sys 0m0.140s user+sys 1.896
grep: real 0m4.005s user 0m3.844s sys 0m0.160s user+sys 4.004
BTW, for the awkies: It seems that a[$0]=1
is faster than a[$0]++
, and (!($0 in a))
is faster than (!a[$0])
. So, for an awk solution I suggest:
awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2
here's another awk solution
$ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2
6
7