问题
I have a large number of large log files (each log file is around 200mb and I have 200GB data in total).
Every 10 minutes, server writes to the log file about 10K parameters (with a timestamp). Out of each 10K parameters, I want to extract 100 of them to a new file.
First I used grep with 1 parameter, then LC_ALL=C made it a little bit faster, then I used fgrep it was also slightly faster. Then I used parallel
parallel -j 2 --pipe --block 20M
and finally, for every 200MB, I was able to extract 1 parameter in 5 seconds.
BUT.. when I pipe multiple parameters in one grep
parallel -j 2 --pipe --block 20M "egrep -n 'param1|param2|...|param100" < log.txt
then the time for grep operation increased linearly (it takes quite bit of minutes to grep 1 file now). (Note that I had to use egrep for multiple pipes, somehow grep didn't like them).
Is there a faster/better way to solve this problem?
Note that I don't need to use regex, because the patterns I am looking for are fixed. I just want to extract certain lines that includes a particular string.
回答1:
In reflect to the above comments i done another test. Taked my file from md5deep -rZ command (size: 319MB). Randomly selected 100 md5 checksums (each 32chars long).
The
time egrep '100|fixed|strings' md5 >/dev/null
time
real 0m16.888s
user 0m16.714s
sys 0m0.172s
for the
time fgrep -f 100_lines_patt_file md5 >/dev/null
the time is
real 0m1.379s
user 0m1.220s
sys 0m0.158s
Nearly 15times faster as egrep.
So, when you get only 0.3 sec improvment betwen egrep and fgrep IMHO thats mean:
- your IO is to slow
The computing time for egrep is not slowed by processor or memory but IO and (IMHO) therefore you don't get any speed improvement with fgrep.
回答2:
Interestingly, compressing the log files into .gz format and using zgrep -E reduced the time dramatically. Also it didn't matter whether I searched for 1 pattern or multiple patterns in a single zgrep command, it just worked around ~1 second per 200MB file.
来源:https://stackoverflow.com/questions/17475791/grep-multiple-strings-on-large-files