Is it worth using Python's re.compile?

前端 未结 26 2137
旧时难觅i
旧时难觅i 2020-11-22 12:51

Is there any benefit in using compile for regular expressions in Python?

h = re.compile(\'hello\')
h.match(\'hello world\')

vs



        
26条回答
  •  旧巷少年郎
    2020-11-22 13:23

    Mostly, there is little difference whether you use re.compile or not. Internally, all of the functions are implemented in terms of a compile step:

    def match(pattern, string, flags=0):
        return _compile(pattern, flags).match(string)
    
    def fullmatch(pattern, string, flags=0):
        return _compile(pattern, flags).fullmatch(string)
    
    def search(pattern, string, flags=0):
        return _compile(pattern, flags).search(string)
    
    def sub(pattern, repl, string, count=0, flags=0):
        return _compile(pattern, flags).sub(repl, string, count)
    
    def subn(pattern, repl, string, count=0, flags=0):
        return _compile(pattern, flags).subn(repl, string, count)
    
    def split(pattern, string, maxsplit=0, flags=0):
        return _compile(pattern, flags).split(string, maxsplit)
    
    def findall(pattern, string, flags=0):
        return _compile(pattern, flags).findall(string)
    
    def finditer(pattern, string, flags=0):
        return _compile(pattern, flags).finditer(string)
    

    In addition, re.compile() bypasses the extra indirection and caching logic:

    _cache = {}
    
    _pattern_type = type(sre_compile.compile("", 0))
    
    _MAXCACHE = 512
    def _compile(pattern, flags):
        # internal: compile pattern
        try:
            p, loc = _cache[type(pattern), pattern, flags]
            if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
                return p
        except KeyError:
            pass
        if isinstance(pattern, _pattern_type):
            if flags:
                raise ValueError(
                    "cannot process flags argument with a compiled pattern")
            return pattern
        if not sre_compile.isstring(pattern):
            raise TypeError("first argument must be string or compiled pattern")
        p = sre_compile.compile(pattern, flags)
        if not (flags & DEBUG):
            if len(_cache) >= _MAXCACHE:
                _cache.clear()
            if p.flags & LOCALE:
                if not _locale:
                    return p
                loc = _locale.setlocale(_locale.LC_CTYPE)
            else:
                loc = None
            _cache[type(pattern), pattern, flags] = p, loc
        return p
    

    In addition to the small speed benefit from using re.compile, people also like the readability that comes from naming potentially complex pattern specifications and separating them from the business logic where there are applied:

    #### Patterns ############################################################
    number_pattern = re.compile(r'\d+(\.\d*)?')    # Integer or decimal number
    assign_pattern = re.compile(r':=')             # Assignment operator
    identifier_pattern = re.compile(r'[A-Za-z]+')  # Identifiers
    whitespace_pattern = re.compile(r'[\t ]+')     # Spaces and tabs
    
    #### Applications ########################################################
    
    if whitespace_pattern.match(s): business_logic_rule_1()
    if assign_pattern.match(s): business_logic_rule_2()
    

    Note, one other respondent incorrectly believed that pyc files stored compiled patterns directly; however, in reality they are rebuilt each time when the PYC is loaded:

    >>> from dis import dis
    >>> with open('tmp.pyc', 'rb') as f:
            f.read(8)
            dis(marshal.load(f))
    
      1           0 LOAD_CONST               0 (-1)
                  3 LOAD_CONST               1 (None)
                  6 IMPORT_NAME              0 (re)
                  9 STORE_NAME               0 (re)
    
      3          12 LOAD_NAME                0 (re)
                 15 LOAD_ATTR                1 (compile)
                 18 LOAD_CONST               2 ('[aeiou]{2,5}')
                 21 CALL_FUNCTION            1
                 24 STORE_NAME               2 (lc_vowels)
                 27 LOAD_CONST               1 (None)
                 30 RETURN_VALUE
    

    The above disassembly comes from the PYC file for a tmp.py containing:

    import re
    lc_vowels = re.compile(r'[aeiou]{2,5}')
    

提交回复
热议问题