Efficiently querying one string against multiple regexes

前端未结

关注

 18  885

感情败类 2020-12-12 17:16

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches. The trivial way to do it would be to jus

18条回答

北海茫月 (楼主)

2020-12-12 17:52
Aho-Corasick was the answer for me.

I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.

Main Caveat: The patters to match were all language patters not regex patterns e.g. 'cat' vs r'\w+'.

I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.
```
import ahocorasick
A = ahocorasick.Automaton()

patterns = [
  [['cat','dog'],'mammals'],
  [['bass','tuna','trout'],'fish'],
  [['toad','crocodile'],'amphibians'],
]

for row in patterns:
    vals = row[0]
    for val in vals:
        A.add_word(val, (row[1], val))

A.make_automaton()

_string = 'tom loves lions tigers cats and bass'

def test():
  vals = []
  for item in A.iter(_string):
      vals.append(item)
  return vals
```
Running %timeit test() on my 2000 categories with about 2-3 traces per category and a _string length of about 100,000 got me 2.09 ms vs 631 ms doing sequential re.search() 315x faster!.
0 讨论(0)

查看其它18个回答
发布评论:

提交评论
- 加载中...