问题
TL;DR: Is there a way to specify a conditional so that an opening element MUST match its paired closing element?
Example is located on regex101.com.
=====
Balancing elements in regex is typically handled through recursion. This means that nested {...{...{...}...}...} can be located.
Also, PCRE allows the (?(DEFINE)...) construct, which lets you define various patterns without actually starting the match.
In the regular expression
# Define the opening and closing elements before the recursion occurs
(?(DEFINE)
(?<open_curly>\{)
(?<close_curly>\})
# ... other definitions here ...
(?<open>\g'open_curly')
(?<close>\g'close_curly')
)
# Match the opening element
(\g'open'
(?>
# For recursion, don't match either the opening or closing element
(?!\g'open'|\g'close')(?s:.)
|
# Recurse this captured pattern
(?-1)
)*
# Match the closing element
\g'close')
the elements are the { and } characters, and can match against patterns such as
{{{}}}
{ test1 { test2 { test3 { test4 } } } }
I want to include other open/close elements, such as [ and ], or --[ and --], so include those in the (?(DEFINE)):
(?<open_square>\[)
(?<close_square>\])
(?P<open_pascal>(?i:\bbegin\b))
(?P<close_pascal>(?i:\bend\b))
(?P<open_lua>--\[)
(?P<close_lua>--\])
(?<open>\g'open_curly'|\g'open_square'|\g'open_pascal'|\g'open_lua')
(?<close>\g'close_curly'|\g'close_square'|\g'close_pascal'|\g'close_lua')
What this DOESN'T do correctly is to pair the opening element with its closing element, allowing --[ to group with }, which is not desirable.
Is there a way to create open/close pairs in a regex like this?
回答1:
I would say that there is no use polluting the logic with a bunch of named groups and
unnecessary recursion that is unpredictable.
There are three main parts to maintaining a proper recursion (listed below).
To do it right, you must parse everything, so you have to take into account
content and unbalanced errors.
The engine won't let you capture in detail, anything beyond the first level.
That means its easier to maintain a TOP that you can pull info from and a CORE
that you can't maintain, but does the recursion. A few subtle differences but can be seen
in the below examples.
Anytime a core is recursed, it is immediatly surrounded by an individual
set (pair) of unique delimiters. This is to unwind the stack properly.
This process can't be factored out (generalized).
update
Usually this regex is called within a recursive function, passing the CORE to it each time.
Example pseudo-code:
bool bIsOk = true;
bool RecurseCore( string core )
{
while( regex_search ( core, regex, match ) )
{
if ( match[1].success ) { print 'content' }
else
if ( match[2].success ) { print 'square'; bIsOk = RecurseCore( match[2].value ) }
else
if ( match[3].success ) { print 'curly'; bIsOk = RecurseCore( match[3].value ) }
else
if ( match[4].success ) { print 'pascal'; bIsOk = RecurseCore( match[4].value ) }
else
if ( match[5].success ) { print 'lua'; bIsOk = RecurseCore( match[5].value ) }
else
if ( match[6].success ) { print 'error'; bIsOk = false } // error
if ( bIsOk == false ) { break }
}
return bIsOk;
}
Regex:
# //////////////////////////////////////////////////////
# // The General Guide to 3-Part Recursive Parsing
# // ----------------------------------------------
# // Part 1. CONTENT
# // Part 2. CORE
# // Part 3. ERRORS
(?si) # Dot all, no case
(?:
( # (1), Take off CONTENT
(?&content)
)
| # OR
\[ # Square's delimiter
( # (2), square CORE
(?= . )
(?&core)
|
)
\] # End-Delimiter
| # OR
\{ # Curly's delimiter
( # (3), curly CORE
(?= . )
(?&core)
|
)
\} # End-Delimiter
| # OR
\b begin \b # Pascal's delimiter
( # (4), pascal CORE
(?= . )
(?&core)
|
)
\b end \b # End-Delimiter
| # OR
--\[ # Lua's delimiter
( # (5), lua CORE
(?= . )
(?&core)
|
)
--\] # End-Delimiter
| # OR
( # (6), Take off Unbalanced (delimeter) ERRORS
\b
(?: begin | end )
\b
| -- [\[\]]
| [\[\]{}]
)
)
# ///////////////////////
# // Subroutines
# // ---------------
(?(DEFINE)
# core
(?<core>
(?>
(?&content)
|
\[
(?: # Square delimiter
(?= . ) # recurse core
(?&core)
|
)
\]
| # OR
\{
(?: # Curly delimiter
(?= . ) # recurse core
(?&core)
|
)
\}
| # OR
\b begin \b
(?: # Pascal delimiter
(?= . ) # recurse core
(?&core)
|
)
\b end \b
| # OR
--\[
(?: # Lua delimiter
(?= . ) # recurse core
(?&core)
|
)
--\]
)+
)
# content
(?<content>
(?>
(?!
\b
(?: begin | end )
\b
| -- [\[\]]
| [\[\]{}]
)
.
)+
)
)
回答2:
You got it wrong. You can achieve it with recursion like this:
(?(DEFINE)
(?<curly> \{ \g<content>*? \} )
(?<square> \[ \g<content>*? \] )
(?<pascal> \bbegin\b \g<content>*? \bend\b )
(?<lua> --\[ \g<content>*? --\] )
(?<nested> \g<curly> | \g<square> | \g<pascal> | \g<lua> )
(?<content>
# Math non-recursive content (atomically)
(?: (?! [{}\[\]] | \bbegin\b | \bend\b | --[\[\]] ) . )++
# Or recurse
| \g<nested>
)
)
\g<nested>
Demo
The trick here is to leverage the regex engine stack to keep a sort of memory of the matched opening symbol, in order to require the matching closing symbol when the regex engine unwinds the stack as it leaves the recursion. You also need to make sure the catch-all . doesn't consume a group starting/ending symbol.
Simpler cases are more readable:
(?(DEFINE)
(?<curly> \{ \g<content>*? \} )
(?<square> \[ \g<content>*? \] )
(?<nested> \g<curly> | \g<square> )
(?<content> [^{}\[\]]++ | \g<nested> )
)
\g<nested>
Demo
Notice the atomic group. It prevents excessive backtracking. Also, this regex instantly consumes anything that can't be recursive without trying to recurse into it first.
Also, to answer your question, conditionals are of no use here, since you need a stack to keep track of the ending symbol you need to match.
Some implementations (PCRE) let you use a stack by means of recursive patterns, while others (.NET) expose a stack through balancing groups. But when you have several kinds of balanced constructs, recursion is the clear winner.
来源:https://stackoverflow.com/questions/27827819/can-conditionals-be-used-to-pair-balance-group-elements