Converting PCRE recursive regex pattern to .NET balancing groups definition

一笑奈何 提交于 2019-12-28 05:14:10

问题


PCRE has a feature called recursive pattern, which can be used to match nested subgroups. For example, consider the "grammar"

Q -> \w | '[' A ';' Q* ','? Q* ']' | '<' A '>'
A -> (Q | ',')*
// to match ^A$.

It can be done in PCRE with the pattern

^((?:,|(\w|\[(?1);(?2)*,?(?2)*\]|<(?1)>))*)$

(Example test case: http://www.ideone.com/L4lHE)

Should match:

abcdefg abc,def,ghi abc,,,def ,,,,,, [abc;] [a,bc;] sss[abc;d] as[abc;d,e] [abc;d,e][fgh;j,k] <abc> [<a>b;<c,d>,<e,f>] <a,b,c> <a,bb,c> <,,,> <> <><> <>,<> a<<<<>>><a>> <<<<<>>>><><<<>>>> <z>[a;b] <z[a;b]> [[;];] [,;,] [;[;]] [<[;]>;<[;][;,<[;,]>]>]

Should not match:

<a bc> <abc<de> [a<b;c>;d,e] [a] <<<<<>>>><><<<>>>>> <<<<<>>>><><<<>>> [abc;def;] [[;],] [;,,] [abc;d,e,f] [<[;]>;<[;][;,<[;,]>]]> <z[a;b>]

There is no recursive pattern in .NET. Instead, it provides balancing groups for stack-based manipulation for matching simple nested patterns.

Is it possible to convert the above PCRE pattern into .NET Regex style?

(Yes I know it's better not to use regex in for this. It's just a theoretical question.)

References

  • pcre.org - PCRE man page - Recursive Patterns
  • MSDN - Regular Expression Language Elements - Balancing Group Definitions

回答1:


The .Net alternative to recursive pattern is a stack. The challenge here is that we need to express the grammar it terms of stacks.
Here's one way of doing that:

A slightly different notation for grammars

  • First, we need grammar rules (like A and Q in the question).
  • We have one stack. The stack can only contain rules.
  • At each step we pop the current state from the stack, match what we need to match, and push further rules into the stack. When we're done with a state we don't push anything and get back to the previous state.

This notation is somewhere between CFG and Pushdown automaton, where we push rules to the stack.

Example:

Let's start with a simple example: anbn. The usual grammar for this language is:

S -> aSb | ε

We can rephrase that to fit the notation:

# Start with <push S>
<pop S> -> "a" <push B> <push S> | ε
<pop B> -> "b"

In words:

  • We start with S in the stack.
  • When we pop S from the stack we can either:
    • Match nothing, or...
    • match "a", but then we have to push the state B to the stack. This is a promise we will match "b". Next we also push S so we can keep matching "a"s if we want to.
  • When we've matched enough "a"s we start popping Bs from the stack, and match a "b" for each one.
  • When this is done, we've matched an even number of "a"'s and "b"s.

or, more loosely:

When we're in case S, match "a" and push B and then S, or match nothing.
When we're in case B, match "b".

In all cases, we pop the current state from the stack.

Writing the pattern in a .Net regular expression

We need to represent the different states somehow. We can't pick '1' '2' '3' or 'a' 'b' 'c', because those may not be available in our input string - we can only match what is present in the string.
One option is to number our states (In the example above, S would state number 0, and B is state 1).
For state S𝒊 we can push 𝒊 characters to the stack. For convenience, we'll push the first 𝒊 characters from the start of the string. Again, we don't care what are these characters, just how many there are.

Push state

In .Net, if we want to push the first 5 characters of a string to a stack, we can write:

(?<=(?=(?<StateId>.{5}))\A.*)

This is a little convoluted:

  • (?<=…\A.*) is a lookbehind that goes all the way to the start of the string.
  • When we're at the start there's a look ahead: (?=…). This is done so we can match beyond the current position - if we are at position 2, we don't have 5 characters before us. So we're looking back and looking forward.
  • (?<StateId>.{5}) push 5 characters to the stack, indicating that at some point we need to match state 5.

Pop state

According to our notation, we always pop the top state from the stack. That is easy: (?<-StateId>).
But before we do that, we want to know what state that was - or how many characters it captured. More specifically, we need to check explicitly for each state, like a switch/case block. So, to check if the stack currently holds state 5:

(?<=(?=.{5}(?<=\A\k<StateId>))\A.*)
  • Again, (?<=…\A.*) goes all the way to the start.
  • Now we advance (?=.{5}…) by five characters...
  • And use another lookbehind, (?<=\A\k<StateId>) to check that the stack really has 5 characters.

This has an obvious drawback - when the string is too short, we cannot represent the number of large states. This problem has solutions:

  • The number of short words in the language is final anyway, so we can add them manually.
  • Use something more complicated than a single stack - we can use several stacks, each with zero or one characters, effectively bits of our state (there's an example at the end).

Result

Our pattern for anbn looks like this:

\A
# Push State A, Index = 0
(?<StateId>)
(?:
    (?:
        (?:
            # When In State A, Index = 0
            (?<=(?=.{0}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            (?:
                # Push State B, Index = 1
                (?<=(?=(?<StateId>.{1}))\A.*)
                a
                # Push State A, Index = 0
                (?<StateId>)
                |

            )
        )
        |
        (?:
            # When In State B, Index = 1
            (?<=(?=.{1}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            b
        )
        |\Z
    ){2}
)+
\Z
# Assert state stack is empty
(?(StateId)(?!))

Working example on Regex Storm

Notes:

  • Note that the quantifier around the states is (?:(?:…){2})+ - that is, (state count)×(input length). I also added an alternation for \Z. Let's not get into that, but it's a workaround for an annoying optimization in the .Net engine.
  • The same can be written as (?<A>a)+(?<-A>b)+(?(A)(?!)) - this is just an exercise.

To answer the question

The grammar from the question can be rewritten as:

# Start with <push A>
<pop A> -> <push A> ( @"," | <push Q> ) | ε
<pop Q> -> \w
           | "<" <push Q2Close> <push A>
           | "[" <push Q1Close> <push QStar> <push Q1Comma> <push QStar> <push Q1Semicolon> <push A>
<pop Q2Close> -> ">"
<pop QStar> -> <push QStar> <push Q> | ε 
<pop Q1Comma> -> ","?
<pop Q1Semicolon> -> ";"
<pop Q1Close> -> "]"

The pattern:

\A
# Push State A, Index = 0
(?<StateId>)
(?:
    (?:
        (?:
            # When In State A, Index = 0
            (?<=(?=.{0}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            (?:
                # Push State A, Index = 0
                (?<StateId>)
                (?:
                    ,
                    |
                    # Push State Q, Index = 1
                    (?<=(?=(?<StateId>.{1}))\A.*)
                )
                |

            )
        )
        |
        (?:
            # When In State Q, Index = 1
            (?<=(?=.{1}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            (?:
                \w
                |
                <
                # Push State Q2Close, Index = 2
                (?<=(?=(?<StateId>.{2}))\A.*)
                # Push State A, Index = 0
                (?<StateId>)
                |
                \[
                # Push State Q1Close, Index = 6
                (?<=(?=(?<StateId>.{6}))\A.*)
                # Push State QStar, Index = 3
                (?<=(?=(?<StateId>.{3}))\A.*)
                # Push State Q1Comma, Index = 4
                (?<=(?=(?<StateId>.{4}))\A.*)
                # Push State QStar, Index = 3
                (?<=(?=(?<StateId>.{3}))\A.*)
                # Push State Q1Semicolon, Index = 5
                (?<=(?=(?<StateId>.{5}))\A.*)
                # Push State A, Index = 0
                (?<StateId>)
            )
        )
        |
        (?:
            # When In State Q2Close, Index = 2
            (?<=(?=.{2}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            >
        )
        |
        (?:
            # When In State QStar, Index = 3
            (?<=(?=.{3}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            (?:
                # Push State QStar, Index = 3
                (?<=(?=(?<StateId>.{3}))\A.*)
                # Push State Q, Index = 1
                (?<=(?=(?<StateId>.{1}))\A.*)
                |

            )
        )
        |
        (?:
            # When In State Q1Comma, Index = 4
            (?<=(?=.{4}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            ,?
        )
        |
        (?:
            # When In State Q1Semicolon, Index = 5
            (?<=(?=.{5}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            ;
        )
        |
        (?:
            # When In State Q1Close, Index = 6
            (?<=(?=.{6}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            \]
        )
        |\Z
    ){7}
)+
\Z
# Assert state stack is empty
(?(StateId)(?!))

Sadly, it is too long to fit in a url, so no online example.

If we use "binary" stacks with one or zero characters, it would have looked like this: https://gist.github.com/kobi/8012361

Here's a screenshot of the pattern passing all tests: http://i.stack.imgur.com/IW2xr.png

Bonus

The .Net engine can do more than just to balance - it can also capture each instance of A or Q. This requires a slight modification of the pattern: https://gist.github.com/kobi/8156968 .
Note that we've added the groups Start, A and Q to the pattern, but they have no effect of the flow, they are used purely for capturing.

The result: for example, for the string "[<a>b;<c,d>,<e,f>]", we can get these Captures:

Group A
    (0-17) [<a>b;<c,d>,<e,f>]
    (1-4) <a>b
    (2-2) a
    (7-9) c,d
    (13-15) e,f
Group Q
    (0-17) [<a>b;<c,d>,<e,f>]
    (1-3) <a>
    (2-2) a
    (4-4) b
    (6-10) <c,d>
    (7-7) c
    (9-9) d
    (12-16) <e,f>
    (13-13) e
    (15-15) f

Open questions

  • Can every grammar be converted to the state-stack notation?
  • Is (state count)×(input length) enough steps to match all words? What other formula can work?

Source code

The code used to generate these patterns and all test cases can be found on https://github.com/kobi/RecreationalRegex




回答2:


The answer is (probably) Yes.

The technique is much more complex than the (?1) recursive call, but the result is almost 1-to-1 with the rules of the grammar - I worked in a such methodical way I can easily see it scripted. Basically, you match block-by-block, and use the stack to keep track of where you are. This is an almost working solution:

^(?:
    (\w(?<Q>)) # Q1
    |
    (<(?<Angle>))  #Q2 - start <
    |
    (\>(?<-Angle>)(?<-A>)?(?<Q>))  #Q2 - end >, match Q
    |
    (\[(?<Block>))  # Q3 start - [
    |
    (;(?<Semi-Block>)(?<-A>)?)  #Q3 - ; after [
    |
    (\](?<-Semi>)(?<-Q>)*(?<Q>))  #Q3 ] after ;, match Q
    |
    ((,|(?<-Q>))*(?<A>))   #Match an A group
)*$
# Post Conditions
(?(Angle)(?!))
(?(Block)(?!))
(?(Semi)(?!))

It is missing the part of allowing commas in Q->[A;Q*,?Q*], and for some reason allows [A;A], so it matches [;,,] and [abc;d,e,f]. Rest of the strings match the same as the test cases.
Another minor point is an issue with pushing to the stack with an empty capture - it doesn't. A accepts Ø, so I had to use (?<-A>)? to check if it captured.

The whole regex should look like this, but again, it is useless with the bug there.

Why it isn't working?

There is not way of synchronizing the stacks: if I push (?<A>) and (?<B>), I can pop them in any order. That is why this pattern cannot differentiate <z[a;b>] from <z[a;b]>... we need one stack for all.
This can be solved for simple cases, but here we have something much more complicate - A whole Q or A, not just "<" or "[".



来源:https://stackoverflow.com/questions/3349999/converting-pcre-recursive-regex-pattern-to-net-balancing-groups-definition

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!