I have a regex that can match a string in multiple overlapping possible ways. However, it seems to only capture one possible match in the string, how can I get all possible
No problem:
>>> regex = "([^-]*-)(?=([^-]*))"
>>> for result in re.finditer(regex, "foo-foobar-foobaz"):
>>> print("".join(result.groups()))
foo-foobar
foobar-foobaz
By putting the second capturing parenthesis in a lookahead assertion, you can capture its contents without consuming it in the overall match.
I've also used [^-]*
instead of .*
because the dot also matches the separator -
which you probably don't want.
If you want to detect overlapping matches, you'll have to implement it yourself - essentially, for a string foo
i
foo[i+1:]
It gets trickier if you're using arbitrary-length capture groups (e.g. (.*)
) because you probably don't want both foo-foobar
and oo-foobar
as matches, so you'd have to do some extra analysis to move i
even farther than just +1
each match; you'd need to move it the entire length of the first captured group's value, plus one.
It's not something regex engines tend to be able to do. I don't know if Python can. Perl can using the following:
local our @matches;
"foo-foobar-foobaz" =~ /
^(.*)-(.*)\z
(?{ push @matches, [ $1, $2 ] })
(*FAIL)
/xs;
This specific problem can probably be solved using the regex engine in many languages using the following technique:
my @matches;
while ("foo-foobar-foobaz" =~ /(?=-(.*)\z)/gsp) {
push @matches, [ ${^PREMATCH}, $1 ];
}
(${^PREMATCH}
refers to what comes before where the regex matched, and $1
refers to what the first ()
matched.)
But you can easily solve this specific problem outside the regex engine:
my @parts = split(/-/, "foo-foobar-foobaz");
my @matches;
for (1..$#parts) {
push @matches, [
join('-', @parts[0..$_-1]),
join('-', @parts[$_..$#parts]),
];
}
Sorry for using Perl syntax, but should be able to get the idea. Translations to Python welcome.