Longest maximum repeating substring

六眼飞鱼酱① 提交于 2019-12-22 18:26:43

问题


A substring can be of length 1,2,3... The question that I was trying to solve involved finding the substring that occurred the maximum number of times. So it basically broke down to finding the character having the maximum frequency. However, I found out that I can find the longest repeating substring using suffix tree in O(n). But, suffix tree returns the substring keeping the length as a priority. I wanted to find the substring which occurs the most number of times, and out of those substrings I want to find the longest one. For eg:

In the following string: ABCZLMNABCZLMNABC
A suffix tree will return ABCZLMN as the longest repeating substring.
However, what I am looking for is ABC;  as it is the longest out of all the ones having frequency = 3. 

I tried solving this problem by generating substring between two indices i and j. After that finding the occurrences of these substrings in each case using Z algorithm running in O(n). However the total complexity was O(n^3)

My O(n^3) code

map<ll,vector<string>> m;
    string s; cin >> s;
    for(ll i=0;i<s.length();i++){
        string c;
        for(ll len=0; i+len<s.length();len++){
            c+=s[i+len];
            ll z[N];
            ll l=0,r=0;
            string kk;
            for(ll p=0;p<c.length();p++){
                kk+=c[p];
            }
            kk+="#";
            for(ll p=0;p<s.length();p++){
                kk+=s[p];
            }
            for(ll k=1;k<kk.length();k++){
                if(k>r){
                    l=r=k;
                    while(r<c.length()&&kk[r-l]==kk[r])r++;
                    z[k]=r-l;
                    r--;
                }
                else{
                    ll m=k-l;
                    if(z[m]<r-k+l)z[k]=z[m];
                    else{
                        l=k;
                        while(r<c.length()&&kk[r-l]==kk[r])r++;
                        z[k]=r-l;
                        r--;
                    }
                }
            }
            ll occ=0;
            for(ll n=0;n<kk.length();n++){
                if(z[n]==c.length())occ++;
            }
            m[occ].push_back(c);
        }
    }

I am not able to find a suitable solution to make it efficient. Kindly help. Thank you.


回答1:


A single character counts as a substring, so therefore the maximum repeating substring must occur with a frequency equal to the most common character in the string.

One implication of that is that each character in the maximum repeating substring can only occur once in the string, because if it occurred more than once then that character on it's own would become the maximum repeating string. For example the substring "dad" occurs 5 times in the string "dadxdadydadzdadydad", but the substring "d" occurs 10 times.

They also have to appear in the same order each time (or else the individual characters would have a higher frequency than the substring and be the maximum repeating substring themselves). They also can't appear separately to the substring (or else yet again they would become the maximum repeating substring).

Therefore, the maximum repeating substring must be made up of a subset (or all) of the equally most frequently occurring characters.

We can easily figure out which characters these are just by making one pass through the string and counting them. We can also deduce which characters appear in which order, by keeping track of which characters appear before and after each character, storing the character if it is the same every time, and zero otherwise. For example, in the string "abcxabcyabczabcyabc", the character "b" is always preceded by "a" and followed by "c":

string s; cin >> s;
int i, freq[256];
char prev[256], next[256];
for(i = 1; i < 256; i++)
    freq[i] = prev[i] = next[i] = 0;
int maxFreq = 0;
for(i = 0; i < s.length(); i++)
{
    char c = s[i];
    char p = (i == 0) ? 0 : s[i-1];
    char n = (i < s.length() - 1) ? s[i+1] : 0;
    if(freq[c] == 0) // first time to encounter this character
    {
        prev[c] = p;
        next[c] = n;
    }
    else // check if it is always preceded and followed by the same characters:
    {
        if(prev[c] != p)
            prev[c] = 0;
        if(next[c] != n)
            next[c] = 0;
    }
    // increment frequency and track the maximum:
    if(++freq[c] > maxFreq)
        maxFreq = freq[c];
}

if(maxFreq == 0)
    return 0;

Then, we can iterate over each character and of the ones that have a frequency equal to the maximum frequency, find the length of string we can form starting with this character by following the next character indices:

int maxLen = 0;
int startingChar = 0;
for(i = 1; i < 256; i++)
{
    // should have a frequency equal to the max and not be preceded
    // by the same character each time (or it is in the middle of the string)
    if((freq[i] == maxFreq) && (prev[i] == 0))
    {
        int len = 1, j = i;
        while(next[j] != 0)
        {
            len++;
            j = next[j];
        }
        if(len > maxLen)
        {
            maxLen = len;
            startingChar = i;
        }
    }
}

Once we've found the maximum repeating substring, print it out:

// print out the maximum length string:
int j = startingChar;
while(j != 0)
{
    cout << (char)j;
    j = next[j];
}
cout << endl;

If you don't like iterating over those fixed size arrays or need to support UNICODE characters etc you can use a map from the character type to a struct containing the character's frequency and prev and next characters.



来源:https://stackoverflow.com/questions/38372159/longest-maximum-repeating-substring

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!