Get all possible matches for regex (in python)?

前端 未结 3 1089
难免孤独
难免孤独 2020-12-18 16:17

I have a regex that can match a string in multiple overlapping possible ways. However, it seems to only capture one possible match in the string, how can I get all possible

相关标签:
3条回答
  • 2020-12-18 16:31

    No problem:

    >>> regex = "([^-]*-)(?=([^-]*))"
    >>> for result in re.finditer(regex, "foo-foobar-foobaz"):
    >>>     print("".join(result.groups()))
    foo-foobar
    foobar-foobaz
    

    By putting the second capturing parenthesis in a lookahead assertion, you can capture its contents without consuming it in the overall match.

    I've also used [^-]* instead of .* because the dot also matches the separator - which you probably don't want.

    0 讨论(0)
  • 2020-12-18 16:31

    If you want to detect overlapping matches, you'll have to implement it yourself - essentially, for a string foo

    1. Find the first match that starts at string index i
    2. Run the matching function again against foo[i+1:]
    3. Repeat steps 1 and 2 on the incrementally short remaining portion of the string.

    It gets trickier if you're using arbitrary-length capture groups (e.g. (.*)) because you probably don't want both foo-foobar and oo-foobar as matches, so you'd have to do some extra analysis to move i even farther than just +1 each match; you'd need to move it the entire length of the first captured group's value, plus one.

    0 讨论(0)
  • 2020-12-18 16:42

    It's not something regex engines tend to be able to do. I don't know if Python can. Perl can using the following:

    local our @matches;
    "foo-foobar-foobaz" =~ /
        ^(.*)-(.*)\z
        (?{ push @matches, [ $1, $2 ] })
        (*FAIL)
    /xs;
    

    This specific problem can probably be solved using the regex engine in many languages using the following technique:

    my @matches;
    while ("foo-foobar-foobaz" =~ /(?=-(.*)\z)/gsp) {
       push @matches, [ ${^PREMATCH}, $1 ];
    }
    

    (${^PREMATCH} refers to what comes before where the regex matched, and $1 refers to what the first () matched.)

    But you can easily solve this specific problem outside the regex engine:

    my @parts = split(/-/, "foo-foobar-foobaz");
    my @matches;
    for (1..$#parts) {
       push @matches, [
          join('-', @parts[0..$_-1]),
          join('-', @parts[$_..$#parts]),
       ];
    }
    

    Sorry for using Perl syntax, but should be able to get the idea. Translations to Python welcome.

    0 讨论(0)
提交回复
热议问题