Converting PCRE recursive regex pattern to .NET balancing groups definition

前端 未结 2 1539
北荒
北荒 2020-12-04 19:59

PCRE has a feature called recursive pattern, which can be used to match nested subgroups. For example, consider the "grammar"

Q -> \\w | \'[\' A          


        
2条回答
  •  我在风中等你
    2020-12-04 20:52

    The .Net alternative to recursive pattern is a stack. The challenge here is that we need to express the grammar it terms of stacks.
    Here's one way of doing that:

    A slightly different notation for grammars

    • First, we need grammar rules (like A and Q in the question).
    • We have one stack. The stack can only contain rules.
    • At each step we pop the current state from the stack, match what we need to match, and push further rules into the stack. When we're done with a state we don't push anything and get back to the previous state.

    This notation is somewhere between CFG and Pushdown automaton, where we push rules to the stack.

    Example:

    Let's start with a simple example: anbn. The usual grammar for this language is:

    S -> aSb | ε
    

    We can rephrase that to fit the notation:

    # Start with 
     -> "a"   | ε
     -> "b"
    

    In words:

    • We start with S in the stack.
    • When we pop S from the stack we can either:
      • Match nothing, or...
      • match "a", but then we have to push the state B to the stack. This is a promise we will match "b". Next we also push S so we can keep matching "a"s if we want to.
    • When we've matched enough "a"s we start popping Bs from the stack, and match a "b" for each one.
    • When this is done, we've matched an even number of "a"'s and "b"s.

    or, more loosely:

    When we're in case S, match "a" and push B and then S, or match nothing.
    When we're in case B, match "b".

    In all cases, we pop the current state from the stack.

    Writing the pattern in a .Net regular expression

    We need to represent the different states somehow. We can't pick '1' '2' '3' or 'a' 'b' 'c', because those may not be available in our input string - we can only match what is present in the string.
    One option is to number our states (In the example above, S would state number 0, and B is state 1).
    For state S

提交回复
热议问题