Is there any benefit in using compile for regular expressions in Python?
h = re.compile(\'hello\')
h.match(\'hello world\')
vs
I agree with Honest Abe that the match(...) in the given examples are different. They are not a one-to-one comparisons and thus, outcomes are vary. To simplify my reply, I use A, B, C, D for those functions in question. Oh yes, we are dealing with 4 functions in re.py instead of 3.
Running this piece of code:
h = re.compile('hello') # (A)
h.match('hello world') # (B)
is same as running this code:
re.match('hello', 'hello world') # (C)
Because, when looked into the source re.py, (A + B) means:
h = re._compile('hello') # (D)
h.match('hello world')
and (C) is actually:
re._compile('hello').match('hello world')
So, (C) is not the same as (B). In fact, (C) calls (B) after calling (D) which is also called by (A). In other words, (C) = (A) + (B). Therefore, comparing (A + B) inside a loop has same result as (C) inside a loop.
George's regexTest.py proved this for us.
noncompiled took 4.555 seconds. # (C) in a loop
compiledInLoop took 4.620 seconds. # (A + B) in a loop
compiled took 2.323 seconds. # (A) once + (B) in a loop
Everyone's interest is, how to get the result of 2.323 seconds. In order to make sure compile(...) only get called once, we need to store the compiled regex object in memory. If we are using a class, we could store the object and reuse when every time our function get called.
class Foo:
regex = re.compile('hello')
def my_function(text)
return regex.match(text)
If we are not using class (which is my request today), then I have no comment. I'm still learning to use global variable in Python, and I know global variable is a bad thing.
One more point, I believe that using (A) + (B) approach has an upper hand. Here are some facts as I observed (please correct me if I'm wrong):
Calls A once, it will do one search in the _cache followed by one sre_compile.compile() to create a regex object. Calls A twice, it will do two searches and one compile (because the regex object is cached).
If the _cache get flushed in between, then the regex object is released from memory and Python need to compile again. (someone suggest that Python won't recompile.)
If we keep the regex object by using (A), the regex object will still get into _cache and get flushed somehow. But our code keep a reference on it and the regex object will not be released from memory. Those, Python need not to compile again.
The 2 seconds differences in George's test compiledInLoop vs compiled is mainly the time required to build the key and search the _cache. It doesn't mean the compile time of regex.
George's reallycompile test show what happen if it really re-do the compile every time: it will be 100x slower (he reduced the loop from 1,000,000 to 10,000).
Here are the only cases that (A + B) is better than (C):
Case that (C) is good enough:
Just a recap, here are the A B C:
h = re.compile('hello') # (A)
h.match('hello world') # (B)
re.match('hello', 'hello world') # (C)
Thanks for reading.