Replace repeating strings in a string

问题

I'm trying to find (and replace) repeated string in a string.

My string can look like this:

Lorem ipsum dolor sit amet sit amet sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.

This should become:

Lorem ipsum dolor sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.

Note how the amit sit isn't removed since its not repeated.

Or the string can be like this:

Lorem ipsum dolor sit amet () sit amet () sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat.

which should become:

Lorem ipsum dolor sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

So its not just a-z but can also have other (ascii) chars. I'm verry happy if someone can help me with this.

The next step would be to match (and replace) something like this:

2 questions 3 questions 4 questions 5 questions

which would become:

2 questions

The number in the final output can be any number 2,3,4, it doesn't matter. There will only be different numbers in the final example but the words will be the same.

回答1:

First task solution code:

<?php

    function split_repeating($string)
    {
        $words = explode(' ', $string);
        $words_count = count($words);

        $need_remove = array();
        for ($i = 0; $i < $words_count; $i++) {
            $need_remove[$i] = false;
        }

        // Here I iterate through the number of words that will be repeated and check all the possible positions reps
        for ($i = round($words_count / 2); $i >= 1; $i--) {
            for ($j = 0; $j < ($words_count - $i); $j++) {
                $need_remove_item = !$need_remove[$j];
                for ($k = $j; $k < ($j + $i); $k++) {
                    if ($words[$k] != $words[$k + $i]) {
                        $need_remove_item = false;
                        break;
                    }
                }
                if ($need_remove_item) {
                    for ($k = $j; $k < ($j + $i); $k++) {
                        $need_remove[$k] = true;
                    }
                }
            }
        }

        $result_string = '';
        for ($i = 0; $i < $words_count; $i++) {
            if (!$need_remove[$i]) {
                $result_string .= ' ' . $words[$i];
            }
        }
        return trim($result_string);
    }



    $string = 'Lorem ipsum dolor sit amet sit amet sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo 'Lorem ipsum dolor sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.' . '<br>' . '<br>';



    $string = 'Lorem ipsum dolor sit amet () sit amet () sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat.';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo 'Lorem ipsum dolor sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.';

?>

Second task solution code:

<?php

    function split_repeating($string)
    {
        $words = explode(' ', $string);
        $words_count = count($words);

        $need_remove = array();
        for ($i = 0; $i < $words_count; $i++) {
            $need_remove[$i] = false;
        }

        for ($j = 0; $j < ($words_count - 1); $j++) {
            $need_remove_item = !$need_remove[$j];
            for ($k = $j + 1; $k < ($words_count - 1); $k += 2) {
                if ($words[$k] != $words[$k + 2]) {
                    $need_remove_item = false;
                    break;
                }
            }
            if ($need_remove_item) {
                for ($k = $j + 2; $k < $words_count; $k++) {
                    $need_remove[$k] = true;
                }
            }
        }

        $result_string = '';
        for ($i = 0; $i < $words_count; $i++) {
            if (!$need_remove[$i]) {
                $result_string .= ' ' . $words[$i];
            }
        }
        return trim($result_string);
    }



    $string = '2 questions 3 questions 4 questions 5 questions';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo '2 questions';

?>

回答2:

If it helps, \1, \2, etc. is used to reference previous grouping. so, for example, the following would pick out repeated words and make them repeat only once:

$string =~ s/(\w+) ( \1)+/$1/g

Repeated phrases could be similiarly put.

回答3:

Interesting question. This can be solved with a single preg_replace() statement but the length of the repeated phrase must be limited to avoid excessive backtracking. Here is a solution with a commented regex that works for the test data and fixes doubled, tripled (or repeated n times) phrases having a max length of 50 chars:

Solution to part 1:

$result = preg_replace('/
    # Match a doubled "phrase" having length up to 50 chars.
    (            # $1: Phrase having whitespace boundaries.
      (?<=\s|^)  # Assert phrase preceded by ws or BOL.
      \S         # First char of phrase is non-whitespace.
      .{0,49}?   # Lazily match phrase (50 chars max).
    )            # End $1: Phrase
    (?:          # Group for one or more duplicate phrases.
      \s+        # Doubled phrase separated by whitespace.
      \1         # Match duplicate of phrase.
    ){1,}        # Require one or more duplicate phrases.
    /x', '$1', $text);

Note that with this solution, a "phrase" can consist of a single word, and there are legitimate cases where doubled words are valid grammar and should not be fixed. If the above solution is not the desired behavior, the regex can be easily modified to define a "phrase" as being two or more "words".

Edit: Modified above regex to handle any number of phrase repetitions. Also added solution to the second part of the question below.

And here is a similar solution where the phrase begins with a word of digits and the repeating phrases must also begin with a word of digits (but the repeating phrases' first word of digits do not need to match the original):

Solution to part 2:

$result = preg_replace('/
    # Match doubled "phrases" with wildcard digits first word.
    (            # $1: 1st word of phrase (digits).
    \b           # Anchor 1st phrase word to word boundary.
    \d+          # Phrase 1st word is string of digits.
    \s+          # 1st and 2nd words separated by whitespace.
    )            # End $1:  1st word of phrase (digits).
    (            # $2: Part of phrase after 1st digits word.
      \S         # First char of phrase is non-whitespace.
      .{0,49}?   # Lazily match phrase (50 chars max).
    )            # End $2: Part of phrase after 1st digits word.
    (?:          # Group for one or more duplicate phrases.
      \s+        # Doubled phrase separated by whitespace.
      \d+        # Match duplicate of phrase.
      \s+        # Doubled phrase separated by whitespace.
      \2         # Match duplicate of phrase.
    ){1,}        # Require one or more duplicate phrases.
    /x', '$1$2', $text);

回答4:

Good old bruteforce...

It's so ugly I inclined to post it as eval(base64_decode(...)), but here it is:

function fixi($str) {
    $a = explode(" ", $str);
    return implode(' ', fix($a));
}

function fix($a) {
    $l = count($a);
    $len = 0;
    for($i=1; $i <= $l/2; $i++) {
        for($j=0; $j <= $l - 2*$i; $j++) {
            $n = 1;
            $found = false;
            while(1) {
                $a1 = array_slice($a, $j, $i);
                $a2 = array_slice($a, $j+$n*$i, $i);
                if ($a1 != $a2)
                    break;
                $found = true;
                $n++;
            }
            if ($found && $n*$i > $len) {
                $len = $n*$i;
                $f_j = $j;
                $f_i = $i;
            }
        }
    }
    if ($len) {
        return array_merge(
            fix(array_slice($a, 0, $f_j)),
            array_slice($a, $f_j, $f_i),
            fix(array_slice($a, $f_j+$len, $l))
        );
    }
    return $a;
}

Punctuation is part of the word, so don't expect miracles.

回答5:

2 questions 3 questions 4 questions 5 questions

becoming

2 questions

Can be solved using:

$string =~ s/(\d+ (.*))( \d+ (\2))+/$1/g;

It matches a number followed by anything (greedily), and then a series of things beginning with a space followed by a number followed by something that matches the previous anything. For all of that, it replaces it with the first number anything pair.

回答6:

((?:\b|^)[\x20-\x7E]+)(\1)+ will match any repeating string of printable ASCII characters that start on a word boundary. Meaning it will match hello hello but not the double l in hello.

If you want to adjust the characters that will match, you can change and add ranges in the form \x##-\x##\x##-\x## (where ## is a hex value) and omit the -\x## where you just want to add one character.

The only problem I can see is that this somewhat simple approach would would pick out a legitimately repeated word rather than a repeated phrase. If you want to force it to only pick off repeated phrases composed of multiple words, you could use something like ((?:\b|^)[\x20-\x7E]+\s)(\1)+ (note the extra \s).

((?:\b|^)[\x20-\x7E]+\s)(.*(\1))+ is getting close to solving your second problem, but I may have thought myself into a corner on that one.

Edit: just to clarify, you'd use $string ~= /((?:\b|^)[\x20-\x7E]+\s)(.*(\1))+/$1/ig in Perl or the PHP equivalent to use that.

回答7:

Thanks a lot all for answering the question. It helped me a lot!. I tried both Ridgerunners and dtanders regular expressions and although they worked (with some modifications) on some test strings I had troubles with other strings.

So I went for the brute force attack :) which is inspired by Nox. This way I could combine both problems and still have a good performance (even better than regexp since that is slow in PHP).

For anyone interested here is the concept code:

function split_repeating_num($string) {
$words = explode(' ', $string);
$all_words = $words;
$num_words = count($words);
$max_length = 100; //max length of substring to check
$max_words = 4; //maximum number of words in substring 
$found = array();
$current_pos = 0;
$unset = array();
foreach ($words as $key=>$word) {
    //see if this word exist in the next part of the string
    $len = strlen($word);
    if ($len === 0) continue;
    $current_pos += $len + 1; //+1 for the space
    $substr = substr($string, $current_pos, $max_length);
    if (($pos = strpos(substr($string, $current_pos, $max_length), $word)) !== false) {
        //found it
        //set pointer words and all_words to same value
        while (key($all_words) < $key ) next($all_words);
        while (key($all_words) > $key ) prev($all_words);
        $next_word = next($all_words);

        while (is_numeric($next_word) || $next_word === '') {
            $next_word = next($all_words);
        }
        // see if it follows the word directly
        if ($word === $next_word) {
            $unset [$key] = 1;
        } elseif ($key + 3 < $num_words) {
            for($i = $max_words; $i > 0; $i --) {
                $x = 0;
                $string_a = '';
                $string_b = '';
                while ($x < $i ) {
                    while (is_numeric($next_word) || $next_word === '' ) {
                        $next_word = each($all_words);
                    }
                    $x ++;
                    $string_a .= $next_word;
                    $string_b .= $words [key($all_words) + $i];
                }

                if ($string_a === $string_b) {
                    //we have a match
                    for($x = $key; $x < $i + $key; $x ++)
                        $unset [$x] = 1;
                }
            }
        }
    }

}
foreach ($unset as $k=>$v) {
    unset($words [$k]);
}
return implode(' ', $words);

}

There are still some minor issues and I do need to test is, but it seems to do its job.

来源：https://stackoverflow.com/questions/6781942/replace-repeating-strings-in-a-string

标签

php

regex

perl

string-search