extracting unique values between 2 sets/files

前端 未结 8 1215
粉色の甜心
粉色の甜心 2020-11-29 19:44

Working in linux/shell env, how can I accomplish the following:

text file 1 contains:

1
2
3
4
5

text file 2 contains:



        
相关标签:
8条回答
  • 2020-11-29 20:22
    cat file1 file2 | sort -u > unique
    
    0 讨论(0)
  • 2020-11-29 20:25

    How about:

    diff file_1 file_2 | grep '^>' | cut -c 3-
    

    This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.

    The files don't even need to be sorted.

    0 讨论(0)
  • 2020-11-29 20:28

    If you are really set on doing this from the command line, this site (search for "no duplicates found") has an awk example that searches for duplicates. It may be a good starting point to look at that.

    However, I'd encourage you to use Perl or Python for this. Basically, the flow of the program would be:

    findUniqueValues(file1, file2){
        contents1 = array of values from file1
        contents2 = array of values from file2
        foreach(value2 in contents2){
            found=false
            foreach(value1 in contents1){
                if (value2 == value1) found=true
            }
            if(!found) print value2
        }
    }
    

    This isn't the most elegant way of doing this, since it has a O(n^2) time complexity, but it will do the job.

    0 讨论(0)
  • 2020-11-29 20:32
    $ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2
    6
    7
    

    Explanation of how the code works:

    • If we're working on file1, track each line of text we see.
    • If we're working on file2, and have not seen the line text, then print it.

    Explanation of details:

    • FNR is the current file's record number
    • NR is the current overall record number from all input files
    • FNR==NR is true only when we are reading file1
    • $0 is the current line of text
    • a[$0] is a hash with the key set to the current line of text
    • a[$0]++ tracks that we've seen the current line of text
    • !($0 in a) is true only when we have not seen the line text
    • Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given
    0 讨论(0)
  • 2020-11-29 20:32

    I was wondering which of the following solutions was the "fastest" for "larger" files:

    awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeX
    awk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2          # awk2 by ghostdog74
    comm -13 <(sort file1) <(sort file2)
    join -v 2 <(sort file1) <(sort file2)
    grep -v -F -x -f file1 file2
    

    Results of my benchmarks in short:

    • Do not use grep -Fxf, it's much slower (2-4 times in my tests).
    • comm is slightly faster than join.
    • If file1 and file2 are already sorted, comm and join are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
    • awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for comm probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.

    For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was

    # Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU
    $ wc file1 file2
      321599   321599  8098710 file1
      321603   321603  8098794 file2
    

    Typical results of fastest runs

    awk2: real 0m1.145s  user 0m1.088s  sys 0m0.056s  user+sys 1.144
    awk1: real 0m1.369s  user 0m1.324s  sys 0m0.044s  user+sys 1.368
    comm: real 0m0.980s  user 0m1.608s  sys 0m0.184s  user+sys 1.792
    join: real 0m1.080s  user 0m1.756s  sys 0m0.140s  user+sys 1.896
    grep: real 0m4.005s  user 0m3.844s  sys 0m0.160s  user+sys 4.004
    

    BTW, for the awkies: It seems that a[$0]=1 is faster than a[$0]++, and (!($0 in a)) is faster than (!a[$0]). So, for an awk solution I suggest:

    awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2
    
    0 讨论(0)
  • 2020-11-29 20:40

    here's another awk solution

    $ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2
    6
    7
    
    0 讨论(0)
提交回复
热议问题