Regex to match the longest repeating substring

前端 未结 5 720
予麋鹿
予麋鹿 2020-12-09 18:33

I\'m writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I\'m matching the result of regex

相关标签:
5条回答
  • 2020-12-09 18:46

    Here's a long-ish script that does what you ask. It basically goes through your input string, shortens it by one, then goes through it again. Once all possible matches are found, it returns one of the longest. It is possible to tweak it so that all the longest matches are returned, instead of just one, but I'll leave that to you.

    It's pretty rudimentary code, but hopefully you'll get the gist of it.

    use v5.10;
    use strict;
    use warnings;
    
    while (<DATA>) {
        chomp;
        print "$_ : ";
        my $longest = foo($_);
        if ($longest) {
            say $longest;
        } else {
            say "No matches found";
        }
    }
    
    sub foo {
        my $num = shift;
        my @hits;
        for my $i (0 .. length($num)) {
            my $part = substr $num, $i;
            push @hits, $part =~ /(.+)(?=\1)/g;
        }
        my $long = shift @hits;
        for (@hits) {
            if (length($long) < length) {
                $long = $_;
            }
        }
        return $long;
    }
    
    __DATA__
    56712453289
    22010110100
    5555555
    1919191919
    191919191919
    2323191919191919
    
    0 讨论(0)
  • 2020-12-09 18:53

    You can do it in a single regex, you just have to pick the longest match from the list of results manually.

    def longestrepeating(strg):
        regex = re.compile(r"(?=(.+)\1)")
        matches = regex.findall(strg)
        if matches:
            return max(matches, key=len)
    

    This gives you (since re.findall() returns a list of the matching capturing groups, even though the matches themselves are zero-length):

    >>> longestrepeating("yabyababyab")
    'abyab'
    >>> longestrepeating("10100101")
    '010'
    >>> strings = ["56712453289", "22010110100", "5555555", "1919191919", 
                   "191919191919", "2323191919191919"]
    >>> [longestrepeating(s) for s in strings]
    [None, '101', '555', '1919', '191919', '191919']
    
    0 讨论(0)
  • 2020-12-09 19:01

    Not sure if anyone's thought of this...

    my $originalstring="pdxabababqababqh1234112341";
    
    my $max=int(length($originalstring)/2);
    my @result;
    foreach my $n (reverse(1..$max)) {
        @result=$originalstring=~m/(.{$n})\1/g;
        last if @result;
    }
    
    print join(",",@result),"\n";
    

    The longest doubled match cannot exceed half the length of the original string, so we count down from there.

    If the matches are suspected to be small relative to the length of the original string, then this idea could be reversed... instead of counting down until we find the match, we count up until there are no more matches. Then we need to back up 1 and give that result. We would also need to put a comma after the $n in the regex.

    my $n;
    foreach (1..$max) {
        unless (@result=$originalstring=~m/(.{$_,})\1/g) {
            $n=--$_;
            last;
        }
    }
    @result=$originalstring=~m/(.{$n})\1/g;
    
    print join(",",@result),"\n";
    
    0 讨论(0)
  • 2020-12-09 19:07

    In Perl you can do it with one expression with help of (??{ code }):

    $_ = '01011010';
    say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
    

    Output:

    101
    

    What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.

    To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.

    The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).

    Thus the full expression would have the form:

    (?=(.+)\1)(?!.+?(..{N,})\2}))
    

    With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).


    Usage example:

    use v5.10;
    
    sub longest_rep{
        $_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
    }
    
    say longest_rep '01011010';
    say longest_rep '010110101000110001';
    say longest_rep '2323191919191919';
    say longest_rep '22010110100';
    

    Output:

    101
    10001
    191919
    101
    
    0 讨论(0)
  • 2020-12-09 19:12

    Regular expressions can be helpful in solving this, but I don't think you can do it as a single expression, since you want to find the longest successful match, whereas regexes just look for the first match they can find. Greediness can be used to tweak which match is found first (earlier vs. later in the string), but I can't think of a way to prefer an earlier, longer substring over a later, shorter substring while also preferring a later, longer substring over an earlier, shorter substring.

    One approach using regular expressions would be to iterate over the possible lengths, in decreasing order, and quit as soon as you find a match of the specified length:

    my $s = '01011010';
    my $one = undef;
    for(my $i = int (length($s) / 2); $i > 0; --$i)
    {
      if($s =~ m/(.{$i})\1/)
      {
        $one = $1;
        last;
      }
    }
    # now $one is '101'
    
    0 讨论(0)
提交回复
热议问题