Regex Recursion: Nth Subpatterns

邮差的信 提交于 2021-01-27 13:44:07

问题


I'm trying to learn about Recursion in Regular Expressions, and have a basic understanding of the concepts in the PCRE flavour. I want to break a string:

Geese (Flock) Dogs (Pack) 

into:

Full Match: Geese (Flock) Dogs (Pack) 
Group 1: Geese (Flock)
Group 2: Geese
Group 3: (Flock)
Group 4: Dogs (Pack)
Group 5: Dogs
Group 6: (Pack)

I know neither regex quite does this, but I was more curious as to the reason why the the first pattern works, but the second one doesn't.

Pattern 1: ((.*?)(\(\w{1,}\)))((.*?)(\g<3>))*
Pattern 2: ((.*?)(\(\w{1,}\)))((\g<2>)(\g<3>))*

Also, if for example you're dealing with a long string, and a pattern repeats itself, is it possible to continually expand the full match, and incrementally increase the groups without writing a loop statement separate to the regex.

Full Match: Geese (Flock) Dogs (Pack) Elephants (Herd) 
Group 1: Geese (Flock)
Group 2: Geese
Group 3: (Flock)
Group 4: Dogs (Pack)
Group 5: Dogs
Group 6: (Pack)
Group 7: Elephants (Herd)
Group 8: Elephants 
Group 9: (Herd)

This is the closest I've came to was this pattern, but the middle group: Dogs (Pack) becomes Group 0.

((.*?)(\(\w{1,}\)))((.*?)(\g<3>))*

回答1:


Mind that recursion levels in PCRE are atomic. Once these patterns find a match they are never re-tried.

See Recursion and Subroutine Calls May or May Not Be Atomic:

Perl and Ruby backtrack into recursion if the remainder of the regex after the recursion fails. They try all permutations of the recursion as needed to allow the remainder of the regex to match. PCRE treats recursion as atomic. PCRE backtracks normally during the recursion, but once the recursion has matched, it does not try any further permutations of the recursion, even when the remainder of the regex fails to match. The result is that Perl and Ruby may find regex matches that PCRE cannot find, or that Perl and Ruby may find different regex matches.

Your second pattern, at the first recursion level, will look like

((.*?)(\(\w{1,}\)))(((?>.*?))((?>\(\w{1,}\))))*
                     ^^^^^^^  ^^^^^^^^^^^^^^

See demo. That is, \g<2> is then (?>.*?), not .*?. That means that, after the ((.*?)(\(\w{1,}\))) pattern matched Geese (Flock), the regex engine tries to match with (?>.*?), sees it is a lazy pattern that does not have to consume any chars, skips it (and will never come back to this pattern), and tries to match with (?>\(\w{1,}\)). As there is no ( after ), the regex returns what it consumed.

As for the second question, it is a common problem. It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer. You cannot have more submatches in the resulting array than the number of capturing groups inside the regex pattern. See Repeating a Capturing Group vs. Capturing a Repeated Group for more details.



来源:https://stackoverflow.com/questions/51900704/regex-recursion-nth-subpatterns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!