extracting unique values between 2 sets/files

前端未结

关注

 8  1215

Working in linux/shell env, how can I accomplish the following:

text file 1 contains:

text file 2 contains:

相关标签:

8条回答

灰色年华

2020-11-29 20:22
```
cat file1 file2 | sort -u > unique
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
旧巷少年郎

2020-11-29 20:25
How about:
```
diff file_1 file_2 | grep '^>' | cut -c 3-
```
This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.

The files don't even need to be sorted.
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2020-11-29 20:28
If you are really set on doing this from the command line, this site (search for "no duplicates found") has an awk example that searches for duplicates. It may be a good starting point to look at that.

However, I'd encourage you to use Perl or Python for this. Basically, the flow of the program would be:
```
findUniqueValues(file1, file2){
    contents1 = array of values from file1
    contents2 = array of values from file2
    foreach(value2 in contents2){
        found=false
        foreach(value1 in contents1){
            if (value2 == value1) found=true
        }
        if(!found) print value2
    }
}
```
This isn't the most elegant way of doing this, since it has a O(n^2) time complexity, but it will do the job.
0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2020-11-29 20:32
```
$ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2
6
7
```
Explanation of how the code works:
- If we're working on file1, track each line of text we see.
- If we're working on file2, and have not seen the line text, then print it.
Explanation of details:
- FNR is the current file's record number
- NR is the current overall record number from all input files
- FNR==NR is true only when we are reading file1
- $0 is the current line of text
- a[$0] is a hash with the key set to the current line of text
- a[$0]++ tracks that we've seen the current line of text
- !($0 in a) is true only when we have not seen the line text
- Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given
0 讨论(0)
发布评论:

提交评论
- 加载中...
野性不改

2020-11-29 20:32
I was wondering which of the following solutions was the "fastest" for "larger" files:
```
awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeX
awk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2          # awk2 by ghostdog74
comm -13 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)
grep -v -F -x -f file1 file2
```
Results of my benchmarks in short:
- Do not use grep -Fxf, it's much slower (2-4 times in my tests).
- comm is slightly faster than join.
- If file1 and file2 are already sorted, comm and join are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
- awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for comm probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.
For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was
```
# Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU
$ wc file1 file2
  321599   321599  8098710 file1
  321603   321603  8098794 file2
```
Typical results of fastest runs
```
awk2: real 0m1.145s  user 0m1.088s  sys 0m0.056s  user+sys 1.144
awk1: real 0m1.369s  user 0m1.324s  sys 0m0.044s  user+sys 1.368
comm: real 0m0.980s  user 0m1.608s  sys 0m0.184s  user+sys 1.792
join: real 0m1.080s  user 0m1.756s  sys 0m0.140s  user+sys 1.896
grep: real 0m4.005s  user 0m3.844s  sys 0m0.160s  user+sys 4.004
```
BTW, for the awkies: It seems that a[$0]=1 is faster than a[$0]++, and (!($0 in a)) is faster than (!a[$0]). So, for an awk solution I suggest:
```
awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2020-11-29 20:40
here's another awk solution
```
$ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2
6
7
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页