Is it worth using Python's re.compile?

前端 未结 26 2092
旧时难觅i
旧时难觅i 2020-11-22 12:51

Is there any benefit in using compile for regular expressions in Python?

h = re.compile(\'hello\')
h.match(\'hello world\')

vs



        
26条回答
  •  挽巷
    挽巷 (楼主)
    2020-11-22 13:14

    Here is an example where using re.compile is over 50 times faster, as requested.

    The point is just the same as what I made in the comment above, namely, using re.compile can be a significant advantage when your usage is such as to not benefit much from the compilation cache. This happens at least in one particular case (that I ran into in practice), namely when all of the following are true:

    • You have a lot of regex patterns (more than re._MAXCACHE, whose default is currently 512), and
    • you use these regexes a lot of times, and
    • you consecutive usages of the same pattern are separated by more than re._MAXCACHE other regexes in between, so that each one gets flushed from the cache between consecutive usages.
    import re
    import time
    
    def setup(N=1000):
        # Patterns 'a.*a', 'a.*b', ..., 'z.*z'
        patterns = [chr(i) + '.*' + chr(j)
                        for i in range(ord('a'), ord('z') + 1)
                        for j in range(ord('a'), ord('z') + 1)]
        # If this assertion below fails, just add more (distinct) patterns.
        # assert(re._MAXCACHE < len(patterns))
        # N strings. Increase N for larger effect.
        strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
        return (patterns, strings)
    
    def without_compile():
        print('Without re.compile:')
        patterns, strings = setup()
        print('searching')
        count = 0
        for s in strings:
            for pat in patterns:
                count += bool(re.search(pat, s))
        return count
    
    def without_compile_cache_friendly():
        print('Without re.compile, cache-friendly order:')
        patterns, strings = setup()
        print('searching')
        count = 0
        for pat in patterns:
            for s in strings:
                count += bool(re.search(pat, s))
        return count
    
    def with_compile():
        print('With re.compile:')
        patterns, strings = setup()
        print('compiling')
        compiled = [re.compile(pattern) for pattern in patterns]
        print('searching')
        count = 0
        for s in strings:
            for regex in compiled:
                count += bool(regex.search(s))
        return count
    
    start = time.time()
    print(with_compile())
    d1 = time.time() - start
    print(f'-- That took {d1:.2f} seconds.\n')
    
    start = time.time()
    print(without_compile_cache_friendly())
    d2 = time.time() - start
    print(f'-- That took {d2:.2f} seconds.\n')
    
    start = time.time()
    print(without_compile())
    d3 = time.time() - start
    print(f'-- That took {d3:.2f} seconds.\n')
    
    print(f'Ratio: {d3/d1:.2f}')
    

    Example output I get on my laptop (Python 3.7.7):

    With re.compile:
    compiling
    searching
    676000
    -- That took 0.33 seconds.
    
    Without re.compile, cache-friendly order:
    searching
    676000
    -- That took 0.67 seconds.
    
    Without re.compile:
    searching
    676000
    -- That took 23.54 seconds.
    
    Ratio: 70.89
    

    I didn't bother with timeit as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile, using the same regex multiple times and moving on to the next one wasn't so bad (only about 2 times as slow as with re.compile), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns) in setup() above (of course I don't recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.

提交回复
热议问题