Is there any benefit in using compile for regular expressions in Python?
h = re.compile(\'hello\')
h.match(\'hello world\')
vs
Here is an example where using re.compile
is over 50 times faster, as requested.
The point is just the same as what I made in the comment above, namely, using re.compile
can be a significant advantage when your usage is such as to not benefit much from the compilation cache. This happens at least in one particular case (that I ran into in practice), namely when all of the following are true:
re._MAXCACHE
, whose default is currently 512), andre._MAXCACHE
other regexes in between, so that each one gets flushed from the cache between consecutive usages.import re
import time
def setup(N=1000):
# Patterns 'a.*a', 'a.*b', ..., 'z.*z'
patterns = [chr(i) + '.*' + chr(j)
for i in range(ord('a'), ord('z') + 1)
for j in range(ord('a'), ord('z') + 1)]
# If this assertion below fails, just add more (distinct) patterns.
# assert(re._MAXCACHE < len(patterns))
# N strings. Increase N for larger effect.
strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
return (patterns, strings)
def without_compile():
print('Without re.compile:')
patterns, strings = setup()
print('searching')
count = 0
for s in strings:
for pat in patterns:
count += bool(re.search(pat, s))
return count
def without_compile_cache_friendly():
print('Without re.compile, cache-friendly order:')
patterns, strings = setup()
print('searching')
count = 0
for pat in patterns:
for s in strings:
count += bool(re.search(pat, s))
return count
def with_compile():
print('With re.compile:')
patterns, strings = setup()
print('compiling')
compiled = [re.compile(pattern) for pattern in patterns]
print('searching')
count = 0
for s in strings:
for regex in compiled:
count += bool(regex.search(s))
return count
start = time.time()
print(with_compile())
d1 = time.time() - start
print(f'-- That took {d1:.2f} seconds.\n')
start = time.time()
print(without_compile_cache_friendly())
d2 = time.time() - start
print(f'-- That took {d2:.2f} seconds.\n')
start = time.time()
print(without_compile())
d3 = time.time() - start
print(f'-- That took {d3:.2f} seconds.\n')
print(f'Ratio: {d3/d1:.2f}')
Example output I get on my laptop (Python 3.7.7):
With re.compile:
compiling
searching
676000
-- That took 0.33 seconds.
Without re.compile, cache-friendly order:
searching
676000
-- That took 0.67 seconds.
Without re.compile:
searching
676000
-- That took 23.54 seconds.
Ratio: 70.89
I didn't bother with timeit
as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile
, using the same regex multiple times and moving on to the next one wasn't so bad (only about 2 times as slow as with re.compile
), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns)
in setup()
above (of course I don't recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.