Intersection of files | 易学教程

问题

I two large files (27k lines and 450k lines). They look sort of like:

File1:
1 2 A 5
3 2 B 7
6 3 C 8
...

File2:
4 2 C 5
7 2 B 7
6 8 B 8
7 7 F 9
...

I want the lines from both files in which the 3rd column is in both files (note lines with A and F were excluded):

OUTPUT:
3 2 B 7
6 3 C 8
4 2 C 5
7 2 B 7
6 8 B 8

whats the best way?

回答1:

awk '{print $3}' file1 | sort | uniq > file1col3
awk '{print $3}' file2 | sort | uniq > file2col3
grep -Fx -f file1col3 file2col3 | awk '{print "\\w+ \\w+ " $1 " \\w+"}' > col3regexp
egrep -xh -f col3regexp file1 file2

Grabs all the unique column 3's in the two files, intersects them (using grep -F), prints a bunch of regular expressions that will match the columns you want, then uses egrep to extract them from the two files.

回答2:

first we sort the files on the third field :

sort -k 3 file1 > file1.sorted
sort -k 3 file2 > file2.sorted

then we get common values on the 3rd field using comm :

comm -12 <(cut -d " " -f 3 file1.sorted | uniq) <(cut -d " " -f 3 file2.sorted | uniq) > common_values.field

now we can join each sorted file on the common values :

join -1 3 -o '1.1,1.2,1.3,1.4' file1.sorted common_values.field > file.joined
join -1 3 -o '1.1,1.2,1.3,1.4' file2.sorted common_values.field >> file.joined

output is formated so we get the same field order as the one used in the files. Standard unix tools used : sort, comm, cut, uniq, join. The <( ) works with bash, for other shells you might use temp files instead.

回答3:

Here's an option using grep, sed and cut.

Extract column 3:

cut -d' ' -f3 file1 > f1c
cut -d' ' -f3 file2 > f2c

Find matching lines in file1:

grep -nFf f2c f1c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file1  > out

Find matching lines in file2:

grep -nFf f1c f2c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file2 >> out

Output:

Update

If you have asymmetric data files and the smaller one fits into memory, this one-pass awk solution would be pretty efficient:

parse.awk

FNR == NR {
  a[$3] = $0
  p[$3] = 1
  next
}  

a[$3]

p[$3] {
  print a[$3]
  delete p[$3]
}

Run it like this:

awk -f parse.awk file1 file2

Where file1 is the smaller of the two.

Explanation

The FNR == NR block reads file1 into two hashes.
a[$3] prints file2 line if $3 is a key in a.
p[$3] prints file1 line if $3 is a key in p and deletes the key (only print once).

回答4:

First obtain the common values from the third column. Then filter the lines from both files that have a matching third column.

If the columns are delimited by a single character, you can use cut to extract one column. For columns that can be separated by an arbitrary amount of whitespace, use awk. One way to obtain the common column 3 values is to extract the columns, sort them and call comm. Using bash/ksh/zsh process substitutions:

comm -12 <(awk '{print $3}' file1 | sort -u) <(awk '{print $3}' file2 | sort -u)

Now turn these into grep patterns, and filter.

comm -12 <(awk '{print $3}' file1 | sort -u) <(awk '{print $3}' file2 | sort -u) |
sed -e 's/[][.\|?*+^$]/\\&/g' \
    -e 's/.*/^[^[:space]]+[[:space]]+[^[:space]]+[[:space]]+\1[[:space]]/' |
grep -E -f - file1 file2

The method above should work reasonably well with huge files. But at 500k lines, you don't have huge files. Those files should fit comfortably in memory, and a simple Perl solution will be fine. Load both files, extract the columns values, print the matching columns.

perl -n -e '
    @lines += 1;
    $c = (split)[2];
    $seen{$c}{$ARGV} = 1;
END {
    foreach (@lines) {
        $c = (split)[2];
        print if %{$seen{$c}} == 2;
    }
}' file1 file2

来源：https://stackoverflow.com/questions/12443110/intersection-of-files

标签

algorithm

file

unix

intersection