extracting unique values between 2 sets/files

前端 未结 8 1232
粉色の甜心
粉色の甜心 2020-11-29 19:44

Working in linux/shell env, how can I accomplish the following:

text file 1 contains:

1
2
3
4
5

text file 2 contains:



        
8条回答
  •  野性不改
    2020-11-29 20:32

    I was wondering which of the following solutions was the "fastest" for "larger" files:

    awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeX
    awk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2          # awk2 by ghostdog74
    comm -13 <(sort file1) <(sort file2)
    join -v 2 <(sort file1) <(sort file2)
    grep -v -F -x -f file1 file2
    

    Results of my benchmarks in short:

    • Do not use grep -Fxf, it's much slower (2-4 times in my tests).
    • comm is slightly faster than join.
    • If file1 and file2 are already sorted, comm and join are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
    • awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for comm probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.

    For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was

    # Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU
    $ wc file1 file2
      321599   321599  8098710 file1
      321603   321603  8098794 file2
    

    Typical results of fastest runs

    awk2: real 0m1.145s  user 0m1.088s  sys 0m0.056s  user+sys 1.144
    awk1: real 0m1.369s  user 0m1.324s  sys 0m0.044s  user+sys 1.368
    comm: real 0m0.980s  user 0m1.608s  sys 0m0.184s  user+sys 1.792
    join: real 0m1.080s  user 0m1.756s  sys 0m0.140s  user+sys 1.896
    grep: real 0m4.005s  user 0m3.844s  sys 0m0.160s  user+sys 4.004
    

    BTW, for the awkies: It seems that a[$0]=1 is faster than a[$0]++, and (!($0 in a)) is faster than (!a[$0]). So, for an awk solution I suggest:

    awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2
    

提交回复
热议问题