Speed up millions of regex replacements in Python 3

后端 未结 9 1294
醉酒成梦
醉酒成梦 2020-11-22 05:44

I\'m using Python 3.5.2

I have two lists

  • a list of about 750,000 \"sentences\" (long strings)
  • a list of about 20,000 \"words\" that I would l
9条回答
  •  一整个雨季
    2020-11-22 06:27

    TLDR

    Use this method if you want the fastest regex-based solution. For a dataset similar to the OP's, it's approximately 1000 times faster than the accepted answer.

    If you don't care about regex, use this set-based version, which is 2000 times faster than a regex union.

    Optimized Regex with Trie

    A simple Regex union approach becomes slow with many banned words, because the regex engine doesn't do a very good job of optimizing the pattern.

    It's possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren't really human-readable, but they do allow for very fast lookup and match.

    Example

    ['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']
    

    The list is converted to a trie:

    {
        'f': {
            'o': {
                'o': {
                    'x': {
                        'a': {
                            'r': {
                                '': 1
                            }
                        }
                    },
                    'b': {
                        'a': {
                            'r': {
                                '': 1
                            },
                            'h': {
                                '': 1
                            }
                        }
                    },
                    'z': {
                        'a': {
                            '': 1,
                            'p': {
                                '': 1
                            }
                        }
                    }
                }
            }
        }
    }
    

    And then to this regex pattern:

    r"\bfoo(?:ba[hr]|xar|zap?)\b"
    

    The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn't match), instead of trying the 5 words. It's a preprocess overkill for 5 words, but it shows promising results for many thousand words.

    Note that (?:) non-capturing groups are used because:

    • foobar|baz would match foobar or baz, but not foobaz
    • foo(bar|baz) would save unneeded information to a capturing group.

    Code

    Here's a slightly modified gist, which we can use as a trie.py library:

    import re
    
    
    class Trie():
        """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
        The corresponding Regex should match much faster than a simple Regex union."""
    
        def __init__(self):
            self.data = {}
    
        def add(self, word):
            ref = self.data
            for char in word:
                ref[char] = char in ref and ref[char] or {}
                ref = ref[char]
            ref[''] = 1
    
        def dump(self):
            return self.data
    
        def quote(self, char):
            return re.escape(char)
    
        def _pattern(self, pData):
            data = pData
            if "" in data and len(data.keys()) == 1:
                return None
    
            alt = []
            cc = []
            q = 0
            for char in sorted(data.keys()):
                if isinstance(data[char], dict):
                    try:
                        recurse = self._pattern(data[char])
                        alt.append(self.quote(char) + recurse)
                    except:
                        cc.append(self.quote(char))
                else:
                    q = 1
            cconly = not len(alt) > 0
    
            if len(cc) > 0:
                if len(cc) == 1:
                    alt.append(cc[0])
                else:
                    alt.append('[' + ''.join(cc) + ']')
    
            if len(alt) == 1:
                result = alt[0]
            else:
                result = "(?:" + "|".join(alt) + ")"
    
            if q:
                if cconly:
                    result += "?"
                else:
                    result = "(?:%s)?" % result
            return result
    
        def pattern(self):
            return self._pattern(self.dump())
    

    Test

    Here's a small test (the same as this one):

    # Encoding: utf-8
    import re
    import timeit
    import random
    from trie import Trie
    
    with open('/usr/share/dict/american-english') as wordbook:
        banned_words = [word.strip().lower() for word in wordbook]
        random.shuffle(banned_words)
    
    test_words = [
        ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
        ("First word", banned_words[0]),
        ("Last word", banned_words[-1]),
        ("Almost a word", "couldbeaword")
    ]
    
    def trie_regex_from_words(words):
        trie = Trie()
        for word in words:
            trie.add(word)
        return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
    
    def find(word):
        def fun():
            return union.match(word)
        return fun
    
    for exp in range(1, 6):
        print("\nTrieRegex of %d words" % 10**exp)
        union = trie_regex_from_words(banned_words[:10**exp])
        for description, test_word in test_words:
            time = timeit.timeit(find(test_word), number=1000) * 1000
            print("  %s : %.1fms" % (description, time))
    

    It outputs:

    TrieRegex of 10 words
      Surely not a word : 0.3ms
      First word : 0.4ms
      Last word : 0.5ms
      Almost a word : 0.5ms
    
    TrieRegex of 100 words
      Surely not a word : 0.3ms
      First word : 0.5ms
      Last word : 0.9ms
      Almost a word : 0.6ms
    
    TrieRegex of 1000 words
      Surely not a word : 0.3ms
      First word : 0.7ms
      Last word : 0.9ms
      Almost a word : 1.1ms
    
    TrieRegex of 10000 words
      Surely not a word : 0.1ms
      First word : 1.0ms
      Last word : 1.2ms
      Almost a word : 1.2ms
    
    TrieRegex of 100000 words
      Surely not a word : 0.3ms
      First word : 1.2ms
      Last word : 0.9ms
      Almost a word : 1.6ms
    

    For info, the regex begins like this:

    (?:a(?:(?:\'s|a(?:\'s|chen|liyah(?:\'s)?|r(?:dvark(?:(?:\'s|s))?|on))|b(?:\'s|a(?:c(?:us(?:(?:\'s|es))?|[ik])|ft|lone(?:(?:\'s|s))?|ndon(?:(?:ed|ing|ment(?:\'s)?|s))?|s(?:e(?:(?:ment(?:\'s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\'s)?|[ds]))?|ing|toir(?:(?:\'s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\'s|es))?|y(?:(?:\'s|s))?)|ot(?:(?:\'s|t(?:\'s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|y(?:\'s)?|\é(?:(?:\'s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|om(?:en(?:(?:\'s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\'s|s))?)|or(?:(?:\'s|s))?|s))?|l(?:\'s)?))|e(?:(?:\'s|am|l(?:(?:\'s|ard|son(?:\'s)?))?|r(?:deen(?:\'s)?|nathy(?:\'s)?|ra(?:nt|tion(?:(?:\'s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\'s|s))?|d)|ing|or(?:(?:\'s|s))?)|s))?|yance(?:\'s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\'s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\'s)?)|gail|l(?:ene|it(?:ies|y(?:\'s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\'s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\'s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\'s|s))?|y)|m\'s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\'s)?))|r(?:\'s)?)|ormal(?:(?:it(?:ies|y(?:\'s)?)|ly))?)|o(?:ard|de(?:(?:\'s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\'s|ist(?:(?:\'s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|r(?:igin(?:al(?:(?:\'s|s))?|e(?:(?:\'s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\'s|ist(?:(?:\'s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\'s|board))?)|r(?:a(?:cadabra(?:\'s)?|d(?:e[ds]?|ing)|ham(?:\'s)?|m(?:(?:\'s|s))?|si(?:on(?:(?:\'s|s))?|ve(?:(?:\'s|ly|ness(?:\'s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\'s|s))?|[ds]))?|ing|ment(?:(?:\'s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\'s)?))?)|s(?:alom|c(?:ess(?:(?:\'s|e[ds]|ing))?|issa(?:(?:\'s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\'s|s))?|t(?:(?:e(?:e(?:(?:\'s|ism(?:\'s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\'s|e(?:\'s)?))?|o(?:l(?:ut(?:e(?:(?:\'s|ly|st?))?|i(?:on(?:\'s)?|sm(?:\'s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\'s)?|t(?:(?:\'s|s))?)|d)|ing|s))?|pti...

    It's really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!

    Here's a diagram of the complete trie, exported with trie-python-graphviz and graphviz twopi:

提交回复
热议问题