Capturing Quantifiers and Quantifier Arithmetic

后端 未结 2 1806

At the outset, let me explain that this question is neither about how to capture groups, nor about how to use quantifiers, two features of regex I am perfectly familiar with

2条回答
  •  暖寄归人
    2020-11-30 05:12

    I don't know a regex engine that can capture a quantifier. However, it is possible with PCRE or Perl to use some tricks to check if you have the same number of characters. With your example:

    @@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"

    you can check if @ = - / are balanced with this pattern that uses the famous Qtax trick, (are you ready?): the "possessive-optional self-referencing group"

    ~(?

    pattern details:

    ~                          # pattern delimiter
    (?

    The main idea

    The non-capturing group contains only one @. Each time this group is repeated a new character is added in capture groups 2, 3 and 4.

    the possessive-optional self-referencing group

    How does it work?

    ( (?: @ (?= [^=]* (\2?+ = ) .....) )+ )
    

    At the first occurence of the @ character the capture group 2 is not yet defined, so you can not write something like that (\2 =) that will make the pattern fail. To avoid the problem, the way is to make the backreference optional: \2?

    The second aspect of this group is that the number of character = matched is incremented at each repetition of the non capturing group, since an = is added each time. To ensure that this number always increases (or the pattern fails), the possessive quantifier forces the backreference to be matched first before adding a new = character.

    Note that this group can be seen like that: if group 2 exists then match it with the next =

    ( (?(2)\2) = )
    

    The recursive way

    ~(?[^@=]+|(?-1))*=)(?!=))(?=(@(?>[^@-]+|(?-1))*-)(?!-))(?=(@(?>[^@/]+|(?-1))*/)(?!/))~
    

    You need to use overlapped matches, since you will use the @ part several times, it is the reason why all the pattern is inside lookarounds.

    pattern details:

    (?           # open an atomic group
                [^@=]+    # all that is not an @ or an =, one or more times
              |           # OR
                (?-1)     # recursion: the last defined capturing group (the current here)
            )*            # repeat zero or more the atomic group
            =             #
        )                 # close the capture group
        (?!=)             # checks the = boundary
    )                     # close the lookahead
    (?=(@(?>[^@-]+|(?-1))*-)(?!-))  # the same for -
    (?=(@(?>[^@/]+|(?-1))*/)(?!/))  # the same for /
    

    The main difference with the precedent pattern is that this one doesn't care about the order of = - and / groups. (However you can easily make some changes to the first pattern to deal with that, with character classes and negative lookaheads.)

    Note: For the example string, to be more specific, you can replace the negative lookbehind with an anchor (^ or \A). And if you want to obtain the whole string as match result you must add .* at the end (otherwise the match result will be empty as playful notices it.)

提交回复
热议问题