find common elements in >2 files

筅森魡賤 提交于 2019-12-04 05:13:39

Try following solution generalized for N files. It saves data of first file in a hash with value of 1, and for each hit from next files that value is incremented. At the end I compare if the value of each key it's the same as the number of files processed and print only those that match.

awk '
    FNR == NR { arr[$1,$2] = 1; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            split( key, key_arr, SUBSEP )
            printf "%s %s\n", key_arr[1], key_arr[2] 
        } 
    }
' file{1..3}

It yields:

"xxx" 0
"aba" 0

EDIT to add a version that prints the whole line (see comments). I've added another array with same key where I save the line, and also use it in the printf function. I've left old code commented.

awk '
    ##FNR == NR { arr[$1,$2] = 1; next }
    FNR == NR { arr[$1,$2] = 1; line[$1,$2] = $0; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            ##split( key, key_arr, SUBSEP )
            ##printf "%s %s\n", key_arr[1], key_arr[2] 
            printf "%s\n", line[ key ] 
        } 
    }
' file{1..3}

NEW EDIT (see comments) to add a version that handles multiple lines with same key. Basically I join all entries instead saving only one, changing line[$1,$2] = $0 with line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0. At the time of printing I do the reverse splitting with the separator (SUBSEP variable) and print each entry.

awk '
    FNR == NR { 
        arr[$1,$2] = 1
        line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
        next
    }
    FNR == 1 { delete found }
    { if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
    END { 
        num_files = ARGC -1 
        for ( key in arr ) {
            if ( arr[key] < num_files ) { continue }
            split( line[ key ], line_arr, SUBSEP )
            for ( i = 1; i <= length( line_arr ); i++ ) { 
                printf "%s\n", line_arr[ i ]
            } 
        } 
    }
' file{1..3}

With new data edited in question, it yields:

"xxx" 0 0
"aba" 0 0 
"aba" 0 0 1

This python script will list out the common lines among all files :

import sys
i,l = 0,[]
for files in sys.argv[1:]:
  l.append(set())
  for line in open(files): l[i].add(" ".join(line.split()[0:2]))
  i+=1
commonFields =  reduce(lambda s1, s2: s1 & s2, l)
for files in sys.argv[1:]:
  print "Common lines in ",files
  for line in open(files):
    for fields in commonFields:
      if fields in line:
        print line,
        break

Usage : python script.py file1 file2 file3 ...

For three files, all you need is:

awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt file2.txt file3.txt

The FNR==NR block returns true for only the first file in the arguments list. The next statement in this block forces a skip over the remained of the code. Therefore, ($1,$2) in a is performed for all files in the arguments list excluding the first one. To process more files in the way you have, all you need to do is list them.


If you need more powerful globbing on the command line, use extglob. You can turn it on with shopt -s extglob, and turn it off with shopt -u extglob. For example:

awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt !(file1.txt)

If you have hard to find files, use find. For example:

awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt $(find /path/to/files -type f -name "*[23].txt")

I assume you're looking for a glob range for 'N' files. For example:

awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt file{2,3}.txt
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!