Efficiently querying one string against multiple regexes

前端 未结 18 885
感情败类
感情败类 2020-12-12 17:16

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches. The trivial way to do it would be to jus

18条回答
  •  北海茫月
    2020-12-12 17:52

    Aho-Corasick was the answer for me.

    I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.

    Main Caveat: The patters to match were all language patters not regex patterns e.g. 'cat' vs r'\w+'.

    I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.

    import ahocorasick
    A = ahocorasick.Automaton()
    
    patterns = [
      [['cat','dog'],'mammals'],
      [['bass','tuna','trout'],'fish'],
      [['toad','crocodile'],'amphibians'],
    ]
    
    for row in patterns:
        vals = row[0]
        for val in vals:
            A.add_word(val, (row[1], val))
    
    A.make_automaton()
    
    _string = 'tom loves lions tigers cats and bass'
    
    def test():
      vals = []
      for item in A.iter(_string):
          vals.append(item)
      return vals
    

    Running %timeit test() on my 2000 categories with about 2-3 traces per category and a _string length of about 100,000 got me 2.09 ms vs 631 ms doing sequential re.search() 315x faster!.

提交回复
热议问题