Find duplicate content in string?

问题

How would you solve the following problem:

I have a semi-large file with text (about 10 pages) and I want to find duplicate content in this text. To be more specific, given a string, find the two longest strings that are identical.

I've been looking at the Longest common subsequence:

http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Longest_common_subsequence

But these implementations take two strings as input.

Maybe there's a service already doing this?

回答1:

Here is a simple (but inefficient) algorithm: Loop all possible substring lengths, from the maximum down to 1. For each length, put all substrings of that length into a dictionary. If you find a duplicate, stop. It must be the largest one. Here is the corresponding C# code:

    public static string FindDuplicateSubstring(string s)
    {
        for (int len = s.Length-1; len > 0; len--) {
            var dict = new Dictionary<string, int>();
            for (int i = 0; i <= s.Length - len; i++) {
                string sub = s.Substring(i, len);
                if (dict.ContainsKey(sub)) return sub;
                else dict[sub] = i;
            }
        }
        return null;
    }

For example, when applied to the text of your question, the longest repeated substring is "implementation". Note that overlapping substrings are allowed, i.e. the input "bbbb" returns "bbb". It wasn't clear from your question if you wanted to exclude the overlapping case. For a faster approach, see my other answer.

回答2:

The 'Longest common subsequence' algorithm does not require the matches to be contiguous substrings. For example, for the strings "computer" and "houseboat" the subsequence is "out". Is this what you want?

If you want the matches to be contiguous substrings, then this is called the longest repeated substring problem. The link describes a linear time and space algorithm using suffix trees.

If you want something short and simple, here is an approach based on the LCS algorithm, but without a table. The idea is to loop all possible integer shifts between the desired substring and its duplicate. For each shift, find the largest contiguous match, by scanning the string once. If the input string has length n, there are O(n) possible shifts and checking each shift takes O(n) time, so the total cost is O(n^2), with only a constant amount of space. (Compare to my simple dictionary answer which takes O(n^3) time and O(n^2) space.) If you don't want overlapping matches (i.e. you want "bbbb" to return "bb" not "bbb"), then when checking each shift, you stop if the largest match exceeds the shift. Here is the C# code:

    public static string FindDuplicateSubstring(string s, 
                                                bool allowOverlap = false)
    {
        int matchPos = 0, maxLength = 0;
        for (int shift = 1; shift < s.Length; shift++) {
            int matchCount = 0;
            for (int i = 0; i < s.Length - shift; i++) {
                if (s[i] == s[i+shift]) {
                    matchCount++;
                    if (matchCount > maxLength) {
                        maxLength = matchCount;
                        matchPos = i-matchCount+1;
                    }
                    if (!allowOverlap && (matchCount == shift)) {
                        // we have found the largest allowable match 
                        // for this shift.
                        break;
                    }
                } else matchCount = 0;
            }
        }
        if (maxLength > 0) return s.Substring(matchPos, maxLength);
        else return null;
    }

I've tested this against my dictionary answer and it gives the same results. But for a random string of length 3000, the dictionary takes 15 seconds, while the above method takes 60ms (and much less memory).

回答3:

try this

public static string FindLargestDuplicateString(
        string text)
    {
        var largest = string.Empty;
        for (var i = '!'; i <= '}'; i++)
        {
            var l = FindLargestDuplicateStringImpl(
                text, i.ToString());
            if (l.Length > largest.Length)
                largest = l;
        }
        return largest;
    }

    private static string FindLargestDuplicateStringImpl(
        string text, string currentLargest)
    {
        bool found = false;
        for (var i = '!'; i <= '}'; i++)
        {
            var comp = currentLargest + i;
            var last = text.LastIndexOf(comp);
            var first = text.IndexOf(comp, 0);
            if (first == -1 || last == -1 || first == last) 
                continue;
            currentLargest = comp;
            found = true;
        }
        return !found ? currentLargest : 
            FindLargestDuplicateStringImpl(text, currentLargest);
    }

回答4:

You can do something like this

public static ArrayList<String> split(String line){
    line+=" ";
    Pattern pattern = Pattern.compile("\\w*\\s");
    Matcher matcher = pattern.matcher(line);
    ArrayList<String> list = new ArrayList<String>();
    while (matcher.find()){
        list.add(matcher.group());
    }
    return list;
}

be sure to remove any punctuation

public static void duplicatedWords(String s, int n){
    ArrayList<String> splitted = split(s); 
    System.out.println(splitted);
    HashMap<String, Integer> map = new HashMap<String, Integer>();
    PriorityQueue<String> pq = new PriorityQueue<String>(splitted.size(), new myComp());
    for(int i = 0; i<splitted.size(); i++){
        if(map.get(splitted.get(i)) == null){
            map.put(splitted.get(i), 1);
        }
        else if(map.get(splitted.get(i)) == 1) {
            map.put(splitted.get(i), map.get(splitted.get(i))+1);
            pq.add(splitted.get(i));
        }
    }
    int size = pq.size();
    for(int i = 0; i<size; i++){
        if(i <n)
            System.out.println(pq.remove());
        else 
            break;
    }
}

With this comparator:

public static class myComp implements Comparator{
    @Override
    public int compare(Object arg0, Object arg1) {
        String s1 = (String)arg0;
        String s2 = (String)arg1;
        return s2.length()-s1.length();
    }

}

来源：https://stackoverflow.com/questions/4223554/find-duplicate-content-in-string

标签

algorithm

string

seo