Compiling Regular Expressions in Python

萝らか妹 提交于 2019-12-05 15:59:40
Tim Pietzcker

Hm. This is strange. My knowledge so far (gained, among other source, from this question) suggested my initial answer:


First answer

Python caches the last 100 regexes that you used, so even if you don't compile them explicitly, they don't have to be recompiled at every use.

However, there are two drawbacks: When the limit of 100 regexes is reached, the entire cache is nuked, so if you use 101 different regexes in a row, each one will be recompiled every time. Well, that's rather unlikely, but still.

Second, in order to find out if a regex has been compiled already, the interpreter needs to look up the regex in the cache every time which does take a little extra time (but not much since dictionary lookups are very fast).

So, if you explicitly compile your regexes, you avoid this extra lookup step.


Update

I just did some testing (Python 3.3):

>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")''')
18.547793477671938
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+","  jkdhf  ")''')
106.47892003890324

So it would appear that no caching is being done. Perhaps that's a quirk of the special conditions under which timeit.timeit() runs?

On the other hand, in Python 2.7, the difference is not as noticeable:

>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")''')
7.248294908492429
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+","  jkdhf  ")''')
18.26713670282241

I believe what he is trying to say is that you shouldn't compile your regex inside your loop, but outside it. You can then just run the already compiled code inside the loop.

instead of:

while true: 
    result = re.match('A', str)

You should put:

regex = re.compile('A')
while true:
    result = regex.match(str)

Basically re.match(pattern, str) combines the compilation and matching step. Compiling the same pattern inside the loop is inefficient, and so should be hoisted outside of the loop.

See Tim's answer for the correct reasoning.

It sounds to me like the author is simply saying it's more efficient to compile a regex and save that than to count on a previously compiled version of it still being held in the module's limited-size internal cache. This is probably because to the amount of effort it takes to compile them plus the extra cache lookup overhead that must first occur being greater than the client simply storing them itself.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!