Regex to match the longest repeating substring

前端未结

关注

 5  720

I\'m writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I\'m matching the result of regex

相关标签:

5条回答

既然无缘

2020-12-09 18:46

Here's a long-ish script that does what you ask. It basically goes through your input string, shortens it by one, then goes through it again. Once all possible matches are found, it returns one of the longest. It is possible to tweak it so that all the longest matches are returned, instead of just one, but I'll leave that to you.

It's pretty rudimentary code, but hopefully you'll get the gist of it.

use v5.10;
use strict;
use warnings;

while (<DATA>) {
    chomp;
    print "$_ : ";
    my $longest = foo($_);
    if ($longest) {
        say $longest;
    } else {
        say "No matches found";
    }
}

sub foo {
    my $num = shift;
    my @hits;
    for my $i (0 .. length($num)) {
        my $part = substr $num, $i;
        push @hits, $part =~ /(.+)(?=\1)/g;
    }
    my $long = shift @hits;
    for (@hits) {
        if (length($long) < length) {
            $long = $_;
        }
    }
    return $long;
}

__DATA__
56712453289
22010110100
5555555
1919191919
191919191919
2323191919191919

0 讨论(0)

深忆病人

2020-12-09 18:53

You can do it in a single regex, you just have to pick the longest match from the list of results manually.

def longestrepeating(strg):
    regex = re.compile(r"(?=(.+)\1)")
    matches = regex.findall(strg)
    if matches:
        return max(matches, key=len)

This gives you (since re.findall() returns a list of the matching capturing groups, even though the matches themselves are zero-length):

>>> longestrepeating("yabyababyab")
'abyab'
>>> longestrepeating("10100101")
'010'
>>> strings = ["56712453289", "22010110100", "5555555", "1919191919", 
               "191919191919", "2323191919191919"]
>>> [longestrepeating(s) for s in strings]
[None, '101', '555', '1919', '191919', '191919']

0 讨论(0)

情话喂你

2020-12-09 19:01
Not sure if anyone's thought of this...
```
my $originalstring="pdxabababqababqh1234112341";

my $max=int(length($originalstring)/2);
my @result;
foreach my $n (reverse(1..$max)) {
    @result=$originalstring=~m/(.{$n})\1/g;
    last if @result;
}

print join(",",@result),"\n";
```
The longest doubled match cannot exceed half the length of the original string, so we count down from there.

If the matches are suspected to be small relative to the length of the original string, then this idea could be reversed... instead of counting down until we find the match, we count up until there are no more matches. Then we need to back up 1 and give that result. We would also need to put a comma after the $n in the regex.
```
my $n;
foreach (1..$max) {
    unless (@result=$originalstring=~m/(.{$_,})\1/g) {
        $n=--$_;
        last;
    }
}
@result=$originalstring=~m/(.{$n})\1/g;

print join(",",@result),"\n";
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
傲寒

2020-12-09 19:07
In Perl you can do it with one expression with help of (??{ code }):
```
$_ = '01011010';
say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
```
Output:
```
101
```
What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.

To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.

The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).

Thus the full expression would have the form:
```
(?=(.+)\1)(?!.+?(..{N,})\2}))
```
With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).

Usage example:
```
use v5.10;

sub longest_rep{
    $_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
}

say longest_rep '01011010';
say longest_rep '010110101000110001';
say longest_rep '2323191919191919';
say longest_rep '22010110100';
```
Output:
```
101
10001
191919
101
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
后悔当初

2020-12-09 19:12
Regular expressions can be helpful in solving this, but I don't think you can do it as a single expression, since you want to find the longest successful match, whereas regexes just look for the first match they can find. Greediness can be used to tweak which match is found first (earlier vs. later in the string), but I can't think of a way to prefer an earlier, longer substring over a later, shorter substring while also preferring a later, longer substring over an earlier, shorter substring.

One approach using regular expressions would be to iterate over the possible lengths, in decreasing order, and quit as soon as you find a match of the specified length:
```
my $s = '01011010';
my $one = undef;
for(my $i = int (length($s) / 2); $i > 0; --$i)
{
  if($s =~ m/(.{$i})\1/)
  {
    $one = $1;
    last;
  }
}
# now $one is '101'
```
0 讨论(0)
发布评论:

提交评论
- 加载中...