Capturing and (thisPartOnly) with the same group

问题

Let's say we have the following input:

<amy>
(bob)
<carol)
(dean>

We also have the following regex:

<(\w+)>|\((\w+)\)

Now we get two matches (as seen on rubular.com):

<amy> is a match, \1 captures amy, \2 fails
(bob) is a match, \2 captures bob, \1 fails

This regex does most of what we want, which are:

It matches the open and close brackets properly (i.e. no mixing)
It captures the part we're interested in

However, it does have a few drawbacks:

The capturing pattern (i.e. the "main" part) is repeated
- It's only \w+ in this case, but generally speaking this can be quite complex,
  - If it involves backreferences, then they must be renumbered for each alternate!
  - Repetition makes maintenance a nightmare! (what if it changes?)
The groups are essentially duplicated
- Depending on which alternate matches, we must query different groups
  - It's only \1 or \2 in this case, but generally the "main" part can have capturing groups of their own!
- Not only is this inconvenient, but there may be situations where this is not feasible (e.g. when we're using a custom regex framework that is limited to querying only one group)
The situation quickly worsens if we also want to match {...}, [...], etc.

So the question is obvious: how can we do this without repeating the "main" pattern?

Note: for the most part I'm interested in java.util.regex flavor, but other flavors are welcomed.

Appendix

There's nothing new in this section; it only illustrates the problem mentioned above with an example.

Let's take the above example to the next step: we now want to match these:

<amy=amy>
(bob=bob)
[carol=carol]

But not these:

<amy=amy)   # non-matching bracket
<amy=bob>   # left hand side not equal to right hand side

Using the alternate technique, we have the following that works (as seen on rubular.com):

<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\]

As explained above:

The main pattern can't simply be repeated; backreferences must be renumbered
Repetition also means maintenance nightmare if it ever changes
Depending on which alternate matches, we must query either \1 \2, \3 \4, or \5 \6

回答1:

You can use a lookahead to "lock in" the group number before doing the real match.

String s = "<amy=amy>(bob=bob)[carol=carol]";
Pattern p = Pattern.compile(
  "(?=[<(\\[]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\])");
Matcher m = p.matcher(s);

while(m.find())
{
  System.out.printf("found %s in %s%n", m.group(2), m.group());
}

output:

found amy in <amy=amy>
found bob in (bob=bob)
found carol in [carol=carol]

It's still ugly as hell, but you don't have to recalculate all the group numbers every time you make a change. For example, to add support for curly brackets, it's just:

"(?=[<(\\[{]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\]|\\{\\1\\})"

回答2:

In preg (Perl Regex library), this will match your example, and \3 will catch the insides:

((<)|\()(\w+)(?(2)>|\))

It will not work in JS, though - you did not specify the dialect...

It depends on the conditional operator (?(2)...|...) which basically says if 2 is a non-null capture, then match before the pipe, else match after the pipe. In this form, pipe is not alternation ("or").

UPDATE Sorry, I completely missed the Java bit :) Anyway, apparently Java does not support the conditional construct; and I have no idea how else I'd go about it :(

Also, for your Appendix (even though it's the wrong dialect):

(?:(<)|(\()|\[)(\w+)=\3(?(1)>|(?(2)\)|]))

The name is in again in \3 (I got rid of the first capturing paren, but I had to add another one for one extra opening paren check)

回答3:

The only solution that I was able to come up with is inspired by technique of capturing an empty string on different alternates; backreferencing to these groups later can serve as pseudo-conditionals.

Thus, this pattern works for the second example (as seen on rubular.com):

                  __main__
                 /        \
(?:<()|\(()|\[())((\w+)=\5)(\1>|\2\)|\3\])
\_______________/          \_____________/
    \1   \2   \3

So essentially for each opening bracket, we assign a group that captures an empty string. Then when we try to match the closing bracket, we see which group was succesful, and match the corresponding closing bracket.

The "main" part does not have to be repeated, but in Java, backreferences may have to be renumbered. This won't be a problem in flavors that support named groups.

回答4:

May be this example in Perl will interest you :

$str = q/<amy=amy> (bob=bob) [carol=carol] <amy=amy) <amy=bob>/;
$re = qr/(?:<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\])+/;
@list = ($str =~ /$re/g);
for(@list) {
    say $i++," = ",$_;
}

I just surround your regex by (?:regex)+

回答5:

When you get things like this, using a single regex is a silly restriction, and I simply don't agree with your "maintenance nightmare" to using more than one - repeating a similar-but-different expression several times is likely to be more maintainable (well, less unmaintainable), and maybe even better performance too, than a single overly-complex regex.

But anyway, there's no repetition if you just use variables to compose your regex.

Here's some pseudo-code:

Brackets = "<>,(),[]"
CoreRegex = "(\w+)=\1"

loop CurBracket in Brackets.split(',')
{
    Input.match( Regex.quote(CurBracket.left(1)) & CoreRegex & Regex.quote(CurBracket.right(1)) )
}

(p.s.that's just to give the general idea - I'd probably use already-escaped arrays for the bracket sets in actual implementation).

回答6:

Assuming there is no easy way to manually write this regular expression, why not leave it to the computer? You could have a function, maybe like below (I am using C# syntax here, as I am a bit more familiar with regexes here than in Java, but it should not be too difficult to adapt it to Java).

Note that I left the function AdaptBackreferences() more or less unimplemented as an exercise to the reader. It should just adapt the backreference numbering.

    struct BracketPair {public string Open; public string Close;};

    static string[] MatchTextInBrackets(string text, string innerPattern, BracketPair[] bracketPairs) {
        StringBuilder sb  = new StringBuilder();

        // count number of catching parentheses of innerPattern here:
        int numberOfInnerCapturingParentheses = Regex.Match("", innerPattern).Groups.Count - 1;

        bool firstTime = true;
        foreach (BracketPair pair in bracketPairs) {
            // apply logic to change backreference numbering:
            string adaptedInnerPattern = AdaptBackreferences(innerPattern);
            if (firstTime) { firstTime = false; } else { sb.Append('|'); }
            sb.Append(pair.Open).Append("(").Append(adaptedInnerPattern).Append(")").Append(pair.Close);
        }
        string myPattern = sb.ToString();
        MatchCollection matches = Regex.Matches(text, myPattern);
        string[] result = new string[matches.Count];
        for(int i=0; i < matches.Count; i++) {
            StringBuilder mb = new StringBuilder();
            for(int j=0; j < bracketPairs.Length; j++) {
                mb.Append(matches[i].Groups[1 + j * (numberOfInnerCapturingParentheses + 1)]); // append them all together, assuming all exept one are empty
            }
            result[i] = mb.ToString();
        }
        return result;
    }

    static string AdaptBackreferences(string pattern) { return pattern; } // to be written

来源：https://stackoverflow.com/questions/3165755/capturing-thispartonly-and-thispartonly-with-the-same-group

标签

java

regex

capturing-group

Capturing <thisPartOnly> and (thisPartOnly) with the same group

问题