问题
What is the fastest method for searching lines in a file containing a string. I have a file containing strings to search. This small file (smallF) contains about 50,000 lines and looks like:
stringToSearch1
stringToSearch2
stringToSearch3
I have to search all of these strings in a larger file (about 100 million lines). If any line in this larger file contains the search string the line is printed.
The best method I have come up with so far is
grep -F -f smallF largeF
But this is not very fast. With just 100 search strings in smallF it takes about 4 minutes. For over 50,000 search strings it will take a lot of time.
Is there a more efficient method?
回答1:
I once noticed that using -E
or multiple -e
parameters is faster than using -f
. Note that this might not be applicable for your problem as you are searching for 50,000 string in a larger file. However I wanted to show you what can be done and what might be worth testing:
Here is what I noticed in detail:
Have 1.2GB file filled with random strings.
>ls -has | grep string
1,2G strings.txt
>head strings.txt
Mfzd0sf7RA664UVrBHK44cSQpLRKT6J0
Uk218A8GKRdAVOZLIykVc0b2RH1ayfAy
BmuCCPJaQGhFTIutGpVG86tlanW8c9Pa
etrulbGONKT3pact1SHg2ipcCr7TZ9jc
.....
Now I want to search for strings "ab", "cd" and "ef" using different grep approaches:
Using grep without flags, search one at a time:
grep "ab" strings.txt > m1.out
2,76s user 0,42s system 96% cpu 3,313 totalgrep "cd" strings.txt >> m1.out
2,82s user 0,36s system 95% cpu 3,322 totalgrep "ef" strings.txt >> m1.out
2,78s user 0,36s system 94% cpu 3,360 total
So in total the search takes nearly 10 seconds.
Using grep with
-f
flag with search strings in search.txt>cat search.txt ab cd ef >grep -F -f search.txt strings.txt > m2.out 31,55s user 0,60s system 99% cpu 32,343 total
For some reasons this takes nearly 32 seconds.
Now using multiple search patterns with
-e
grep -E "ab|cd|ef" strings.txt > m3.out 3,80s user 0,36s system 98% cpu 4,220 total
or
grep --color=auto -e "ab" -e "cd" -e "ef" strings.txt > /dev/null 3,86s user 0,38s system 98% cpu 4,323 total
The third methode using -E
only took 4.22 seconds to search through the file.
Now lets check if the results are the same:
cat m1.out | sort | uniq > m1.sort
cat m3.out | sort | uniq > m3.sort
diff m1.sort m3.sort
#
The diff produces no output, which means the found results are the same.
Maybe want to give it a try, otherwise I would advise you to look at the thread "Fastest possible grep", see comment from Cyrus.
回答2:
You may want to try sift or ag. Sift in particular lists some pretty impressive benchmarks versus grep.
回答3:
Note: I realise the following is not a bash
based solution, but given your large search space, a parallel solution is warranted.
If your machine has more than one core/processor, you could call the following function in Pythran, to parallelize the search:
#!/usr/bin/env python
#pythran export search_in_file(string, string)
def search_in_file(long_file_path, short_file_path):
_long = open(long_file_path, "r")
#omp parallel for schedule(guided)
for _string in open(short_file_path, "r"):
if _string in _long:
print(_string)
if __name__ == "__main__":
search_in_file("long_file_path", "short_file_path")
Note: Behind the scenes, Pythran takes Python code and attempt to aggressively compile it into very fast C++.
来源:https://stackoverflow.com/questions/37693682/fast-string-search-in-a-very-large-file