Is it worth using Python's re.compile?

前端 未结 26 2101
旧时难觅i
旧时难觅i 2020-11-22 12:51

Is there any benefit in using compile for regular expressions in Python?

h = re.compile(\'hello\')
h.match(\'hello world\')

vs



        
26条回答
  •  没有蜡笔的小新
    2020-11-22 13:06

    I just tried this myself. For the simple case of parsing a number out of a string and summing it, using a compiled regular expression object is about twice as fast as using the re methods.

    As others have pointed out, the re methods (including re.compile) look up the regular expression string in a cache of previously compiled expressions. Therefore, in the normal case, the extra cost of using the re methods is simply the cost of the cache lookup.

    However, examination of the code, shows the cache is limited to 100 expressions. This begs the question, how painful is it to overflow the cache? The code contains an internal interface to the regular expression compiler, re.sre_compile.compile. If we call it, we bypass the cache. It turns out to be about two orders of magnitude slower for a basic regular expression, such as r'\w+\s+([0-9_]+)\s+\w*'.

    Here's my test:

    #!/usr/bin/env python
    import re
    import time
    
    def timed(func):
        def wrapper(*args):
            t = time.time()
            result = func(*args)
            t = time.time() - t
            print '%s took %.3f seconds.' % (func.func_name, t)
            return result
        return wrapper
    
    regularExpression = r'\w+\s+([0-9_]+)\s+\w*'
    testString = "average    2 never"
    
    @timed
    def noncompiled():
        a = 0
        for x in xrange(1000000):
            m = re.match(regularExpression, testString)
            a += int(m.group(1))
        return a
    
    @timed
    def compiled():
        a = 0
        rgx = re.compile(regularExpression)
        for x in xrange(1000000):
            m = rgx.match(testString)
            a += int(m.group(1))
        return a
    
    @timed
    def reallyCompiled():
        a = 0
        rgx = re.sre_compile.compile(regularExpression)
        for x in xrange(1000000):
            m = rgx.match(testString)
            a += int(m.group(1))
        return a
    
    
    @timed
    def compiledInLoop():
        a = 0
        for x in xrange(1000000):
            rgx = re.compile(regularExpression)
            m = rgx.match(testString)
            a += int(m.group(1))
        return a
    
    @timed
    def reallyCompiledInLoop():
        a = 0
        for x in xrange(10000):
            rgx = re.sre_compile.compile(regularExpression)
            m = rgx.match(testString)
            a += int(m.group(1))
        return a
    
    r1 = noncompiled()
    r2 = compiled()
    r3 = reallyCompiled()
    r4 = compiledInLoop()
    r5 = reallyCompiledInLoop()
    print "r1 = ", r1
    print "r2 = ", r2
    print "r3 = ", r3
    print "r4 = ", r4
    print "r5 = ", r5
    
    And here is the output on my machine:
    $ regexTest.py 
    noncompiled took 4.555 seconds.
    compiled took 2.323 seconds.
    reallyCompiled took 2.325 seconds.
    compiledInLoop took 4.620 seconds.
    reallyCompiledInLoop took 4.074 seconds.
    r1 =  2000000
    r2 =  2000000
    r3 =  2000000
    r4 =  2000000
    r5 =  20000
    

    The 'reallyCompiled' methods use the internal interface, which bypasses the cache. Note the one that compiles on each loop iteration is only iterated 10,000 times, not one million.

提交回复
热议问题