Speed up millions of regex replacements in Python 3

后端 未结 9 1293
醉酒成梦
醉酒成梦 2020-11-22 05:44

I\'m using Python 3.5.2

I have two lists

  • a list of about 750,000 \"sentences\" (long strings)
  • a list of about 20,000 \"words\" that I would l
9条回答
  •  时光取名叫无心
    2020-11-22 06:13

    Well, here's a quick and easy solution, with test set.

    Winning strategy:

    re.sub("\w+",repl,sentence) searches for words.

    "repl" can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.

    This is the simplest and fastest solution (see function replace4 in example code below).

    Second best

    The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.

    (see function replace3 in example code below).

    Timings for example functions:

    replace1: 0.62 sentences/s
    replace2: 7.43 sentences/s
    replace3: 48498.03 sentences/s
    replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)
    

    ...and code:

    #! /bin/env python3
    # -*- coding: utf-8
    
    import time, random, re
    
    def replace1( sentences ):
        for n, sentence in enumerate( sentences ):
            for search, repl in patterns:
                sentence = re.sub( "\\b"+search+"\\b", repl, sentence )
    
    def replace2( sentences ):
        for n, sentence in enumerate( sentences ):
            for search, repl in patterns_comp:
                sentence = re.sub( search, repl, sentence )
    
    def replace3( sentences ):
        pd = patterns_dict.get
        for n, sentence in enumerate( sentences ):
            #~ print( n, sentence )
            # Split the sentence on non-word characters.
            # Note: () in split patterns ensure the non-word characters ARE kept
            # and returned in the result list, so we don't mangle the sentence.
            # If ALL separators are spaces, use string.split instead or something.
            # Example:
            #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
            #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
            words = re.split(r"([^\w]+)", sentence)
    
            # and... done.
            sentence = "".join( pd(w,w) for w in words )
    
            #~ print( n, sentence )
    
    def replace4( sentences ):
        pd = patterns_dict.get
        def repl(m):
            w = m.group()
            return pd(w,w)
    
        for n, sentence in enumerate( sentences ):
            sentence = re.sub(r"\w+", repl, sentence)
    
    
    
    # Build test set
    test_words = [ ("word%d" % _) for _ in range(50000) ]
    test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]
    
    # Create search and replace patterns
    patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
    patterns_dict = dict( patterns )
    patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]
    
    
    def test( func, num ):
        t = time.time()
        func( test_sentences[:num] )
        print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))
    
    print( "Sentences", len(test_sentences) )
    print( "Words    ", len(test_words) )
    
    test( replace1, 1 )
    test( replace2, 10 )
    test( replace3, 1000 )
    test( replace4, 1000 )
    

    Edit: You can also ignore lowercase when checking if you pass a lowercase list of Sentences and edit repl

    def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w.lower(),w)
    

提交回复
热议问题