Print lines in one file matching patterns in another file

后端未结

关注

 5  959

I have a file with more than 40.000 lines (file1) and I want to extract the lines matching patterns in file2 (about 6000 lines). I use grep like this, but it is very slow: <

相关标签:

5条回答

生来不讨喜

2020-11-29 07:12

Just for fun, here's a Perl version:

#!/usr/bin/perl
use strict;
use warnings;
my %patterns;
my $srch;

# Open file and get patterns to search for
open(my $fh2,"<","file2")|| die "ERROR: Could not open file2";
while (<$fh2>)
{
   chop;
   $patterns{$_}=1;
}

# Now read data file
open(my $fh1,"<","file1")|| die "ERROR: Could not open file1";
while (<$fh1>)
{
   (undef,$srch,undef)=split;
   print $_ if defined $patterns{$srch};
}

Here are some timings, using a 60,000 line file1 and 6,000 line file2 per Ed's file creation method:

time awk 'NR==FNR{pats[$0]; next} $2 in pats' file2 file1 > out
real    0m0.202s
user    0m0.197s
sys     0m0.005s

time ./go.pl > out2
real    0m0.083s
user    0m0.079s
sys     0m0.004s

0 讨论(0)

执笔经年

2020-11-29 07:27

Just for the sake of learning: I was solving the same problem and I came up with various solutions (including read $line loops etc..). When I got to the grep one-liner found above, I still ended up getting the wrong output. Then I realized my PATTERN file had 2 trailing lines... So grep picked up all my lines from my database. Morality: check you trailing spaces/lines. Also, ran the command on a much larger dataset with several hundreds patterns and time couldn't even count.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2020-11-29 07:30
Here's how to do it in awk:
```
awk 'NR==FNR{pats[$0]; next} $2 in pats' File2 File1
```
Using a 60,000 line File1 (your File1 repeated 8000 times) and a 6,000 File2 (yours repeated 1200 times):
```
$ time grep -Fwf File2 File1 > ou2

real    0m0.094s
user    0m0.031s
sys     0m0.062s

$ time awk 'NR==FNR{pats[$0]; next} $2 in pats' File2 File1 > ou1

real    0m0.094s
user    0m0.015s
sys     0m0.077s

$ diff ou1 ou2
```
i.e. it's about as fast as the grep. One thing to note though is that the awk solution lets you pick a specific field to match on so if anything from File2 shows up anywhere else in File1 you won't get a false match. It also lets you match on a whole field at a time so if your target strings were various lengths and you didn't want "scign000003" to match "scign0000031" for example (though the -w for grep gives similar protection for that).

For completeness, here's the timing for the other awk solution posted elsethread:
```
$ time awk 'BEGIN{i=0}FNR==NR{a[i++]=$1;next}{for(j=0;j<i;j++)if(index($0,a[j]))print $0}' File2 File1 > ou3

real    3m34.110s
user    3m30.850s
sys     0m1.263s
```
and here's the timing I get for the perl script Mark posted:
```
$ time ./go.pl > out2

real    0m0.203s
user    0m0.124s
sys     0m0.062s
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤街浪徒

2020-11-29 07:32

Try grep -Fwf file2 file1 > out

The -F option specifies plain string matching, so should be faster without having to engage the regex engine.

0 讨论(0)
发布评论:

提交评论
- 加载中...
执念已碎

2020-11-29 07:34
You could try with this awk:
```
awk 'BEGIN{i=0}
FNR==NR { a[i++]=$1; next }
{ for(j=0;j<i;j++)
    if(index($0,a[j]))
        {print $0;break}
}' file2 file1
```
The FNR==NR part specifies that the stuff following it in curly braces is only to be applied when processing the first input file (file2). And it says to save all the words you are looking for in an array a[]. The bit in the second set of curly braces applies to the processing of the second file... as each line is read in, it is compared with all elements of a[] and if any are found, the line is printed. That's all folks!
0 讨论(0)
发布评论:

提交评论
- 加载中...