grep a large list against a large file

前端 未结 4 852
天命终不由人
天命终不由人 2020-12-07 17:09

I am currently trying to grep a large list of ids (~5000) against an even larger csv file (3.000.000 lines).

I want all the csv lines, that contain an i

相关标签:
4条回答
  • 2020-12-07 17:52

    You may get a significant search speedup with ugrep to match the strings in the_ids.txt in your large huge.csv file:

    ugrep -F -f the_ids.txt huge.csv
    

    This works with GNU grep too, but I expect ugrep to run several times faster.

    0 讨论(0)
  • 2020-12-07 17:53

    Try

    grep -f the_ids.txt huge.csv
    

    Additionally, since your patterns seem to be fixed strings, supplying the -F option might speed up grep.

       -F, --fixed-strings
              Interpret PATTERN as a  list  of  fixed  strings,  separated  by
              newlines,  any  of  which is to be matched.  (-F is specified by
              POSIX.)
    
    0 讨论(0)
  • 2020-12-07 17:58

    Use grep -f for this:

    grep -f the_ids.txt huge.csv > output_file
    

    From man grep:

    -f FILE, --file=FILE

    Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing. (-f is specified by POSIX.)

    If you provide some sample input maybe we can even improve the grep condition a little more.

    Test

    $ cat ids
    11
    23
    55
    $ cat huge.csv 
    hello this is 11 but
    nothing else here
    and here 23
    bye
    
    $ grep -f ids huge.csv 
    hello this is 11 but
    and here 23
    
    0 讨论(0)
  • 2020-12-07 18:05

    grep -f filter.txt data.txt gets unruly when filter.txt is larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f, we need to keep a few things in mind:

    • use -x option if there is a need to match the entire line in the second file
    • use -F if the first file has strings, not patterns
    • use -w to prevent partial matches while not using the -x option

    This post has a great discussion on this topic (grep -f on large files):

    • Fastest way to find lines of a file from another larger file in Bash

    And this post talks about grep -vf:

    • grep -vf too slow with large files

    In summary, the best way to handle grep -f on large files is:

    Matching entire line:

    awk 'FNR==NR {hash[$0]; next} $0 in hash' filter.txt data.txt > matching.txt
    

    Matching a particular field in the second file (using ',' delimiter and field 2 in this example):

    awk -F, 'FNR==NR {hash[$1]; next} $2 in hash' filter.txt data.txt > matching.txt
    

    and for grep -vf:

    Matching entire line:

    awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > not_matching.txt
    

    Matching a particular field in the second file (using ',' delimiter and field 2 in this example):

    awk -F, 'FNR==NR {hash[$0]; next} !($2 in hash)' filter.txt data.txt > not_matching.txt
    
    0 讨论(0)
提交回复
热议问题