Empty string instead of unmatched group error

后端 未结 3 872
萌比男神i
萌比男神i 2020-11-29 11:21

I have this piece of code:

for n in (range(1,10)):
    new = re.sub(r\'(regex(group)regex)?regex\', r\'something\'+str(n)+r\'\\1\', old, count=1)


        
相关标签:
3条回答
  • 2020-11-29 12:06

    I looked at this again.
    A note that it is unfortunate that you have to deal with NULL's,
    but here are the rules you must follow.

    The below matches all successfully match nothing.
    You have to do this to find out the rules.

    It's not as simple as you may think. Take a close look at the results.
    There is no apparent steadfast way formwise to tell if you will get NULL or EMPTY.

    However, looking at it closer, the rules come out and are fairly simple.
    These rules must be followed if you care about NULL.

    There are only Two rules:

    Rule # 1 - Any code GROUP that can't be reached, will result in NULL

       (?<Alt_1>                     # (1 start)
            (?<a> a )?                    # (2)
            (?<b> b? )                    # (3)
       )?                            # (1 end)
    |  
       (?<Alt_2>                     # (4 start)
            (?<c> c? )                    # (5)
            (?<d> d? )                    # (6)
       )                             # (4 end)
    
     **  Grp 0         -  ( pos 0 , len 0 )  EMPTY 
     **  Grp 1 [Alt_1] -  ( pos 0 , len 0 )  EMPTY 
     **  Grp 2 [a]     -  NULL 
     **  Grp 3 [b]     -  ( pos 0 , len 0 )  EMPTY 
     **  Grp 4 [Alt_2] -  NULL 
     **  Grp 5 [c]     -  NULL 
    

    Rule # 2 - Any code GROUP that can't be matched on the INSIDE, will result in NULL

     (?<A_1>                       # (1 start)
          (?<a1> a? )                   # (2)
     )?                            # (1 end)
     (?<A_2>                       # (3 start)
          (?<a2> a )?                   # (4)
     )?                            # (3 end)
     (?<A_3>                       # (5 start)
          (?<a3> a )                    # (6)
     )?                            # (5 end)
     (?<A_4>                       # (7 start)
          (?<a4> a )?                   # (8)
     )                             # (7 end)
    
    **  Grp 0       -  ( pos 0 , len 0 )  EMPTY 
    **  Grp 1 [A_1] -  ( pos 0 , len 0 )  EMPTY 
    **  Grp 2 [a1]  -  ( pos 0 , len 0 )  EMPTY 
    **  Grp 3 [A_2] -  ( pos 0 , len 0 )  EMPTY 
    **  Grp 4 [a2]  -  NULL 
    **  Grp 5 [A_3] -  NULL 
    **  Grp 6 [a3]  -  NULL 
    **  Grp 7 [A_4] -  ( pos 0 , len 0 )  EMPTY 
    **  Grp 8 [a4]  -  NULL 
    
    0 讨论(0)
  • 2020-11-29 12:18

    To simplify:

    Problem

    1. You are getting the error "sre_constants.error: unmatched group" from a Python 2.7 regex.
    2. You have any regex pattern with optional groups (with or without nested expressions) and are trying to use those groups in your sub replacement argument (re.sub(pattern, *repl*, string) or compiled.sub(*repl*, string))

    Solution:

    For results, return match.group(1) instead of \1 (or 2, 3, etc.). That's it; there is no or needed. The group result(s) can be returned with a function or a lambda.

    Example

    You are using a common regex to strip C-style comments. Its design uses an optional group 1 to pass through pseudo-comments which should not be deleted (if they exist).

    pattern = r'//.*|/\*[\s\S]*?\*/|("(\\.|[^"])*")'
    regex = re.compile(pattern)
    

    Using \1 fails with the error: "sre_constants.error: unmatched group":

    return regex.sub(r'\1', string)
    

    Using .group(1) succeeds:

    return regex.sub(lambda m: m.group(1), string)
    

    For those not familiar with lambda, this solution is equivalent to:

    def optgroup(match):
        return match.group(1)
    return regex.sub(optgroup, string)
    

    See the accepted answer for an excellent discussion of why \1 fails due to Bug 1519638. While the accepted answer is authoritative, it has two shortcomings: 1) the example from the original question is so convoluted that it makes the example solution difficult reading, and 2) it suggests returning a group or empty string -- that is not required, you may merely call .group() on each match.

    0 讨论(0)
  • 2020-11-29 12:21

    Root cause

    Before Python 3.5, backreferences to failed capture groups in Python re.sub were not populated with an empty string. Here is Bug 1519638 description at bugs.python.org. Thus, when using a backreference to a group that did not participate in the match resulted in an error.

    There are two ways to fix that issue.

    Solution 1: Adding empty alternatives to make optional groups obligatory

    You can replace all optional capturing groups (those constructs like (\d+)?) with obligatory ones with an empty alternative (i.e. (\d+|)).

    Here is an example of the failure:

    import re
    old = 'regexregex'
    new = re.sub(r'regex(group)?regex', r'something\1something', old)
    print(new)
    

    Replacing one line with

    new = re.sub(r'regex(group|)regex', r'something\1something', old)
    

    It works.

    Solution 2: Using lambda expression in the replacement and checking if the group is not None

    This approach is necessary if you have optional groups inside another optional group.

    You can use a lambda in the replacement part to check if the group is initialized, not None, with lambda m: m.group(n) or ''. Use this solution in your case, because you have two backreferences - #3 and #4 - in the replacement pattern, but some matches (see Match 1 and 3) do not have Capture group 3 initialized. It happens because the whole first part - (\s*\{{2}funcA(ka|)\s*\|\s*([^}]*)\s*\}{2}\s*|) - is not participating in the match, and the inner Capture group 3 (i.e. ([^}]*)) just does not get populated even after adding an empty alternative.

    re.sub(r'(?i)(\s*\{{2}funcA(ka|)\s*\|\s*([^\}]*)\s*\}{2}\s*|)\{{2}funcB\s*\|\s*([^\}]*)\s*\}{2}\s*', 
    r"\n | funcA"+str(n)+r" = \3\n | funcB"+str(n)+r" = \4\n | string"+str(n)+r" = \n", 
    text, 
    count=1)
    

    should be re-written with

    re.sub(r'(?i)(\s*{{funcA(ka|)\s*\|\s*([^}]*)\s*}}\s*|){{funcB\s*\|\s*([^}]*)\s*}}\s*', 
    lambda m: r"\n | funcA"+str(n)+r" = " + (m.group(3) or '') + "\n | funcB" + str(n) + r" = " + (m.group(4) or '') + "\n | string" + str(n) + r" = \n", 
    text, 
    count=1)  
    

    See IDEONE demo

    import re
     
    text = r'''
     
    {{funcB|param1}}
    *some string*
    {{funcA|param2}}
    {{funcB|param3}}
    *some string2*
     
    {{funcB|param4}}
    *some string3*
    {{funcAka|param5}}
    {{funcB|param6}}
    *some string4*
    '''
     
    for n in (range(1,(text.count('funcB')+1))):
        text = re.sub(r'(?i)(\s*\{{2}funcA(ka|)\s*\|\s*([^\}]*)\s*\}{2}\s*|)\{{2}funcB\s*\|\s*([^\}]*)\s*\}{2}\s*', 
        lambda m: r"\n | funcA"+str(n)+r" = "+(m.group(3) or '')+"\n | funcB"+str(n)+r" = "+(m.group(4) or '')+"\n | string"+str(n)+r" = \n", 
        text, 
        count=1) 
        
    assert text == r'''
    | funcA1 =
    | funcB1 = param1
    | string1 =
    *some string*
    | funcA2 = param2
    | funcB2 = param3
    | string2 =
    *some string2*
    | funcA3 =
    | funcB3 = param4
    | string3 =
    *some string3*
    | funcA4 = param5
    | funcB4 = param6
    | string4 =
    *some string4*
    '''
    print 'ok'

    0 讨论(0)
提交回复
热议问题