Speed up millions of regex replacements in Python 3

后端 未结 9 1295
醉酒成梦
醉酒成梦 2020-11-22 05:44

I\'m using Python 3.5.2

I have two lists

  • a list of about 750,000 \"sentences\" (long strings)
  • a list of about 20,000 \"words\" that I would l
9条回答
  •  迷失自我
    2020-11-22 06:13

    TLDR

    Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP's, it's approximately 2000 times faster than the accepted answer.

    If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.

    Theory

    If your sentences aren't humongous strings, it's probably feasible to process many more than 50 per second.

    If you save all the banned words into a set, it will be very fast to check if another word is included in that set.

    Pack the logic into a function, give this function as argument to re.sub and you're done!

    Code

    import re
    with open('/usr/share/dict/american-english') as wordbook:
        banned_words = set(word.strip().lower() for word in wordbook)
    
    
    def delete_banned_words(matchobj):
        word = matchobj.group(0)
        if word.lower() in banned_words:
            return ""
        else:
            return word
    
    sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
                 "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000
    
    word_pattern = re.compile('\w+')
    
    for sentence in sentences:
        sentence = word_pattern.sub(delete_banned_words, sentence)
    

    Converted sentences are:

    ' .  !
      .
    GiraffeElephantBoat
    sfgsdg sdwerha aswertwe
    

    Note that:

    • the search is case-insensitive (thanks to lower())
    • replacing a word with "" might leave two spaces (as in your code)
    • With python3, \w+ also matches accented characters (e.g. "ångström").
    • Any non-word character (tab, space, newline, marks, ...) will stay untouched.

    Performance

    There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.

    In comparison, Liteye's answer needed 160s for 10 thousand sentences.

    With n being the total amound of words and m the amount of banned words, OP's and Liteye's code are O(n*m).

    In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).

    Regex union test

    What's the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?

    It's pretty hard to grasp the way the regex engine works, so let's write a simple test.

    This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :

    • one is clearly not a word (it begins with #)
    • one is the first word in the list
    • one is the last word in the list
    • one looks like a word but isn't


    import re
    import timeit
    import random
    
    with open('/usr/share/dict/american-english') as wordbook:
        english_words = [word.strip().lower() for word in wordbook]
        random.shuffle(english_words)
    
    print("First 10 words :")
    print(english_words[:10])
    
    test_words = [
        ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
        ("First word", english_words[0]),
        ("Last word", english_words[-1]),
        ("Almost a word", "couldbeaword")
    ]
    
    
    def find(word):
        def fun():
            return union.match(word)
        return fun
    
    for exp in range(1, 6):
        print("\nUnion of %d words" % 10**exp)
        union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
        for description, test_word in test_words:
            time = timeit.timeit(find(test_word), number=1000) * 1000
            print("  %-17s : %.1fms" % (description, time))
    

    It outputs:

    First 10 words :
    ["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']
    
    Union of 10 words
      Surely not a word : 0.7ms
      First word        : 0.8ms
      Last word         : 0.7ms
      Almost a word     : 0.7ms
    
    Union of 100 words
      Surely not a word : 0.7ms
      First word        : 1.1ms
      Last word         : 1.2ms
      Almost a word     : 1.2ms
    
    Union of 1000 words
      Surely not a word : 0.7ms
      First word        : 0.8ms
      Last word         : 9.6ms
      Almost a word     : 10.1ms
    
    Union of 10000 words
      Surely not a word : 1.4ms
      First word        : 1.8ms
      Last word         : 96.3ms
      Almost a word     : 116.6ms
    
    Union of 100000 words
      Surely not a word : 0.7ms
      First word        : 0.8ms
      Last word         : 1227.1ms
      Almost a word     : 1404.1ms
    

    So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:

    • O(1) best case
    • O(n/2) average case, which is still O(n)
    • O(n) worst case

    These results are consistent with a simple loop search.

    A much faster alternative to a regex union is to create the regex pattern from a trie.

提交回复
热议问题