Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches. The trivial way to do it would be to jus
Aho-Corasick was the answer for me.
I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.
Main Caveat: The patters to match were all language patters not regex patterns e.g. 'cat'
vs r'\w+'
.
I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.
import ahocorasick
A = ahocorasick.Automaton()
patterns = [
[['cat','dog'],'mammals'],
[['bass','tuna','trout'],'fish'],
[['toad','crocodile'],'amphibians'],
]
for row in patterns:
vals = row[0]
for val in vals:
A.add_word(val, (row[1], val))
A.make_automaton()
_string = 'tom loves lions tigers cats and bass'
def test():
vals = []
for item in A.iter(_string):
vals.append(item)
return vals
Running %timeit test()
on my 2000 categories with about 2-3 traces per category and a _string
length of about 100,000
got me 2.09 ms
vs 631 ms
doing sequential re.search()
315x faster!.