Recursive PHP Regex

前端 未结 4 727
醉梦人生
醉梦人生 2020-12-05 07:00

EDIT: I selected ridgerunner\'s answer as it contained the information needed to solve the problem. But I also felt like adding a fully fleshed-out solution to the s

4条回答
  •  死守一世寂寞
    2020-12-05 07:50

    IMPORTANT: This describes recursive regex in PHP (which uses the PCRE library). Recursive regex works a bit differently in Perl itself.

    Note: This is explained in the order you can conceptualize it. The regex engine does it backward of this; it dives down to the base case and works its way back.

    Since your outer as are explicitly there, it will match an a between two as, or a previous recursion's match of the entire pattern between two as. As a result, it will only match odd numbers of as (middle one plus multiples of two).

    At length of three, aaa is the current recursion's matching pattern, so on the fourth recursion it's looking for an a between two as (i.e., aaa) or the previous recursion's matched pattern between two as (i.e., a+aaa+a). Obviously it can't match five as when the string isn't that long, so the longest match it can make is three.

    Similar deal with a length of six, as it can only match the "default" aaa or the previous recursion's match surrounded by as (i.e., a+aaaaa+a).


    However, it does not match all odd lengths.

    Since you're matching recursively, you can only match the literal aaa or a+(prev recurs match)+a. Each successive match will therefore always be two as longer than the previous match, or it will punt and fall back to aaa.

    At a length of seven (matching against aaaaaaa), the previous recursion's match was the fallback aaa. So this time, even though there are seven as, it will only match three (aaa) or five (a+aaa+a).


    When looping to longer lengths (80 in this example), look at the pattern (showing only the match, not the input):

    no match
    aa
    aaa
    aaa
    aaaaa
    aaa
    aaaaa
    aaaaaaa
    aaaaaaaaa
    aaa
    aaaaa
    aaaaaaa
    aaaaaaaaa
    aaaaaaaaaaa
    aaaaaaaaaaaaa
    aaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaa
    aaa
    aaaaa
    aaaaaaa
    aaaaaaaaa
    aaaaaaaaaaa
    aaaaaaaaaaaaa
    aaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaa
    aaaaa
    aaaaaaa
    aaaaaaaaa
    aaaaaaaaaaa
    aaaaaaaaaaaaa
    aaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaa
    aaaaa
    aaaaaaa
    aaaaaaaaa
    aaaaaaaaaaa
    aaaaaaaaaaaaa
    aaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaa
    

    What's going on here? Well, I'll tell you! :-)

    When a recursive match would be one character longer than the input string, it punts back to aaa, as we've seen. In every iteration after that, the pattern starts over of matching two more characters than the previous match. Every iteration, the length of the input increases by one, but the length of the match increases by two. When the match size finally catches back up and surpasses the length of the input string, it punts back to aaa. And so on.

    Alternatively viewed, here we can see how many characters longer the input is compared to the match length in each iteration:

    (input len.)  -  (match len.)  =  (difference)
    
     1   -    0   =    1
     2   -    2   =    0
     3   -    3   =    0
     4   -    3   =    1
     5   -    5   =    0
     6   -    3   =    3
     7   -    5   =    2
     8   -    7   =    1
     9   -    9   =    0
    10   -    3   =    7
    11   -    5   =    6
    12   -    7   =    5
    13   -    9   =    4
    14   -   11   =    3
    15   -   13   =    2
    16   -   15   =    1
    17   -   17   =    0
    18   -    3   =   15
    19   -    5   =   14
    20   -    7   =   13
    21   -    9   =   12
    22   -   11   =   11
    23   -   13   =   10
    24   -   15   =    9
    25   -   17   =    8
    26   -   19   =    7
    27   -   21   =    6
    28   -   23   =    5
    29   -   25   =    4
    30   -   27   =    3
    31   -   29   =    2
    32   -   31   =    1
    33   -   33   =    0
    34   -    3   =   31
    35   -    5   =   30
    36   -    7   =   29
    37   -    9   =   28
    38   -   11   =   27
    39   -   13   =   26
    40   -   15   =   25
    41   -   17   =   24
    42   -   19   =   23
    43   -   21   =   22
    44   -   23   =   21
    45   -   25   =   20
    46   -   27   =   19
    47   -   29   =   18
    48   -   31   =   17
    49   -   33   =   16
    50   -   35   =   15
    51   -   37   =   14
    52   -   39   =   13
    53   -   41   =   12
    54   -   43   =   11
    55   -   45   =   10
    56   -   47   =    9
    57   -   49   =    8
    58   -   51   =    7
    59   -   53   =    6
    60   -   55   =    5
    61   -   57   =    4
    62   -   59   =    3
    63   -   61   =    2
    64   -   63   =    1
    65   -   65   =    0
    66   -    3   =   63
    67   -    5   =   62
    68   -    7   =   61
    69   -    9   =   60
    70   -   11   =   59
    71   -   13   =   58
    72   -   15   =   57
    73   -   17   =   56
    74   -   19   =   55
    75   -   21   =   54
    76   -   23   =   53
    77   -   25   =   52
    78   -   27   =   51
    79   -   29   =   50
    80   -   31   =   49
    

    For reasons that should now make sense, this happens at multiples of 2.


    Stepping through by hand

    I've slightly simplified the original pattern for this example. Remember this. We will come back to it.

    a((?R)|a)a
    

    What the author Jeffrey Friedl means by "the (?R) construct makes a recursive reference to the entire regular expression" is that the regex engine will substitute the entire pattern in place of (?R) as many times as possible.

    a((?R)|a)a                    # this
    
    a((a((?R)|a)a)|a)a            # becomes this
    
    a((a((a((?R)|a)a)|a)a)|a)a    # becomes this
    
    # and so on...
    

    When tracing this by hand, you could work from the inside out. In (?R)|a, a is your base case. So we'll start with that.

    a(a)a
    

    If that matches the input string, take that match (aaa) back to the original expression and put it in place of (?R).

    a(aaa|a)a
    

    If the input string is matched with our recursive value, subtitute that match (aaaaa) back into the original expression to recurse again.

    a(aaaaa|a)a
    

    Repeat until you can't match your input using the result of the previous recursion.

    Example
    Input: aaaaaa
    Regex: a((?R)|a)a

    Start at base case, aaa.
    Does the input match with this value? Yes: aaa
    Recurse by putting aaa in the original expression:

    a(aaa|a)a
    

    Does the input match with our recursive value? Yes: aaaaa
    Recurse by putting aaaaa in the original expression:

    a(aaaaa|a)a
    

    Does the input match with our recursive value? No: aaaaaaa

    Then we stop here. The above expression could be rewritten (for simplicity) as:

    aaaaaaa|aaa
    

    Since it doesn't match aaaaaaa, it must match aaa. We're done, aaa is the final result.

提交回复
热议问题