PCRE has a feature called recursive pattern, which can be used to match nested subgroups. For example, consider the "grammar"
Q -> \\w | \'[\' A
The answer is (probably) Yes.
The technique is much more complex than the (?1)
recursive call, but the result is almost 1-to-1 with the rules of the grammar - I worked in a such methodical way I can easily see it scripted. Basically, you match block-by-block, and use the stack to keep track of where you are. This is an almost working solution:
^(?:
(\w(?<Q>)) # Q1
|
(<(?<Angle>)) #Q2 - start <
|
(\>(?<-Angle>)(?<-A>)?(?<Q>)) #Q2 - end >, match Q
|
(\[(?<Block>)) # Q3 start - [
|
(;(?<Semi-Block>)(?<-A>)?) #Q3 - ; after [
|
(\](?<-Semi>)(?<-Q>)*(?<Q>)) #Q3 ] after ;, match Q
|
((,|(?<-Q>))*(?<A>)) #Match an A group
)*$
# Post Conditions
(?(Angle)(?!))
(?(Block)(?!))
(?(Semi)(?!))
It is missing the part of allowing commas in Q->[A;Q*,?Q*]
, and for some reason allows [A;A]
, so it matches [;,,]
and [abc;d,e,f]
. Rest of the strings match the same as the test cases.
Another minor point is an issue with pushing to the stack with an empty capture - it doesn't. A
accepts Ø, so I had to use (?<-A>)?
to check if it captured.
The whole regex should look like this, but again, it is useless with the bug there.
There is not way of synchronizing the stacks: if I push (?<A>)
and (?<B>)
, I can pop them in any order. That is why this pattern cannot differentiate <z[a;b>]
from <z[a;b]>
... we need one stack for all.
This can be solved for simple cases, but here we have something much more complicate - A whole Q
or A
, not just "<" or "[".
The .Net alternative to recursive pattern is a stack. The challenge here is that we need to express the grammar it terms of stacks.
Here's one way of doing that:
A
and Q
in the question).This notation is somewhere between CFG and Pushdown automaton, where we push rules to the stack.
Let's start with a simple example: anbn. The usual grammar for this language is:
S -> aSb | ε
We can rephrase that to fit the notation:
# Start with <push S>
<pop S> -> "a" <push B> <push S> | ε
<pop B> -> "b"
In words:
S
in the stack.S
from the stack we can either:
B
to the stack. This is a promise we will match "b". Next we also push S
so we can keep matching "a"s if we want to.B
s from the stack, and match a "b" for each one.or, more loosely:
When we're in case S, match "a" and push B and then S, or match nothing.
When we're in case B, match "b".
In all cases, we pop the current state from the stack.
We need to represent the different states somehow. We can't pick '1' '2' '3' or 'a' 'b' 'c', because those may not be available in our input string - we can only match what is present in the string.
One option is to number our states (In the example above, S
would state number 0, and B
is state 1).
For state S