How to find Longest Common Substring using C++

前端 未结 7 1948
轮回少年
轮回少年 2020-12-05 07:43

I searched online for a C++ Longest Common Substring implementation but failed to find a decent one. I need a LCS algorithm that returns the substring itself, so it\'s not j

相关标签:
7条回答
  • 2020-12-05 08:27

    I tried several different solutions for this but they all seemed really slow so I came up with the below, didn't really test much, but it seems to work a bit faster for me.

    #include <iostream>
    
    std::string lcs( std::string a, std::string b )
    {
        if( a.empty() || b.empty() ) return {} ;
    
        std::string current_lcs = "";
    
        for(int i=0; i< a.length(); i++) {
            size_t fpos = b.find(a[i], 0);
            while(fpos != std::string::npos) {
                std::string tmp_lcs = "";
                tmp_lcs += a[i];
                for (int x = fpos+1; x < b.length(); x++) {
                    tmp_lcs+=b[x];
                    size_t spos = a.find(tmp_lcs, 0);
                    if (spos == std::string::npos) {
                        break;
                    } else {
                        if (tmp_lcs.length() > current_lcs.length()) {
                            current_lcs = tmp_lcs;
                        }
                    }
                }
                fpos = b.find(a[i], fpos+1);
            }
        }
        return current_lcs;
    }
    
    int main(int argc, char** argv)
    {
        std::cout << lcs(std::string(argv[1]), std::string(argv[2])) << std::endl;
    }
    
    0 讨论(0)
  • There is a very elegant Dynamic Programming solution to this.

    Let LCSuff[i][j] be the longest common suffix between X[1..m] and Y[1..n]. We have two cases here:

    • X[i] == Y[j], that means we can extend the longest common suffix between X[i-1] and Y[j-1]. Thus LCSuff[i][j] = LCSuff[i-1][j-1] + 1 in this case.

    • X[i] != Y[j], since the last characters themselves are different, X[1..i] and Y[1..j] can't have a common suffix. Hence, LCSuff[i][j] = 0 in this case.

    We now need to check maximal of these longest common suffixes.

    So, LCSubstr(X,Y) = max(LCSuff(i,j)), where 1<=i<=m and 1<=j<=n

    The algorithm pretty much writes itself now.

    string LCSubstr(string x, string y){
        int m = x.length(), n=y.length();
    
        int LCSuff[m][n];
    
        for(int j=0; j<=n; j++)
            LCSuff[0][j] = 0;
        for(int i=0; i<=m; i++)
            LCSuff[i][0] = 0;
    
        for(int i=1; i<=m; i++){
            for(int j=1; j<=n; j++){
                if(x[i-1] == y[j-1])
                    LCSuff[i][j] = LCSuff[i-1][j-1] + 1;
                else
                    LCSuff[i][j] = 0;
            }
        }
    
        string longest = "";
        for(int i=1; i<=m; i++){
            for(int j=1; j<=n; j++){
                if(LCSuff[i][j] > longest.length())
                    longest = x.substr((i-LCSuff[i][j]+1) -1, LCSuff[i][j]);
            }
        }
        return longest;
    }
    
    0 讨论(0)
  • 2020-12-05 08:33

    Find the largest substring from all strings under consideration. From N strings, you'll have N substrings. Choose the largest of those N.

    0 讨论(0)
  • 2020-12-05 08:37

    The answer is GENERALISED SUFFIX TREE. http://en.wikipedia.org/wiki/Generalised_suffix_tree

    You can build a generalised suffix tree with multiple string.

    Look at this http://en.wikipedia.org/wiki/Longest_common_substring_problem

    The Suffix tree can be built in O(n) time for each string, k*O(n) in total. K is total number of strings.

    So it's very quick to solve this problem.

    0 讨论(0)
  • 2020-12-05 08:37

    This is a dynamic programming problem and can be solved in O(mn) time, where m is the length of one string and n is of other.

    Like any other problem solved using Dynamic Programming, we will divide the problem into subproblem. Lets say if two strings are x1x2x3....xm and y1y2y3...yn

    S(i,j) is the longest common string for x1x2x3...xi and y1y2y3....yj, then

    S(i,j) = max { length of longest common substring ending at xi/yj, if ( x[i] == y[j] ), S(i-1, j-1), S(i, j-1), S(i-1, j) }

    Here is working program in Java. I am sure you can convert it to C++.:

    public class LongestCommonSubstring {
    
        public static void main(String[] args) {
            String str1 = "abcdefgijkl";
            String str2 = "mnopabgijkw";
            System.out.println(getLongestCommonSubstring(str1,str2));
        }
    
        public static String getLongestCommonSubstring(String str1, String str2) {
            //Note this longest[][] is a standard auxialry memory space used in Dynamic
                    //programming approach to save results of subproblems. 
                    //These results are then used to calculate the results for bigger problems
            int[][] longest = new int[str2.length() + 1][str1.length() + 1];
            int min_index = 0, max_index = 0;
    
                    //When one string is of zero length, then longest common substring length is 0
            for(int idx = 0; idx < str1.length() + 1; idx++) {
                longest[0][idx] = 0;
            }
    
            for(int idx = 0; idx < str2.length() + 1; idx++) {
                longest[idx][0] = 0;
            }
    
            for(int i = 0; i <  str2.length(); i++) {
                for(int j = 0; j < str1.length(); j++) {
    
                    int tmp_min = j, tmp_max = j, tmp_offset = 0;
    
                    if(str2.charAt(i) == str1.charAt(j)) {
                        //Find length of longest common substring ending at i/j
                        while(tmp_offset <= i && tmp_offset <= j &&
                                str2.charAt(i - tmp_offset) == str1.charAt(j - tmp_offset)) {
    
                            tmp_min--;
                            tmp_offset++;
    
                        }
                    }
                    //tmp_min will at this moment contain either < i,j value or the index that does not match
                    //So increment it to the index that matches.
                    tmp_min++;
    
                    //Length of longest common substring ending at i/j
                    int length = tmp_max - tmp_min + 1;
                    //Find the longest between S(i-1,j), S(i-1,j-1), S(i, j-1)
                    int tmp_max_length = Math.max(longest[i][j], Math.max(longest[i+1][j], longest[i][j+1]));
    
                    if(length > tmp_max_length) {
                        min_index = tmp_min;
                        max_index = tmp_max;
                        longest[i+1][j+1] = length;
                    } else {
                        longest[i+1][j+1] = tmp_max_length;
                    }
    
    
                }
            }
    
            return str1.substring(min_index, max_index >= str1.length() - 1 ? str1.length() - 1 : max_index + 1);
        }
    }
    
    0 讨论(0)
  • 2020-12-05 08:38

    Here is a C# version to find the Longest Common Substring using dynamic programming of two arrays (you may refer to: http://codingworkout.blogspot.com/2014/07/longest-common-substring.html for more details)

    class LCSubstring
            {
                public int Length = 0;
                public List<Tuple<int, int>> indices = new List<Tuple<int, int>>();
            }
            public string[] LongestCommonSubStrings(string A, string B)
            {
                int[][] DP_LCSuffix_Cache = new int[A.Length+1][];
                for (int i = 0; i <= A.Length; i++)
                {
                    DP_LCSuffix_Cache[i] = new int[B.Length + 1];
                }
                LCSubstring lcsSubstring = new LCSubstring();
                for (int i = 1; i <= A.Length; i++)
                {
                    for (int j = 1; j <= B.Length; j++)
                    {
                        //LCSuffix(Xi, Yj) = 0 if X[i] != X[j]
                        //                 = LCSuffix(Xi-1, Yj-1) + 1 if Xi = Yj
                        if (A[i - 1] == B[j - 1])
                        {
                            int lcSuffix = 1 + DP_LCSuffix_Cache[i - 1][j - 1];
                            DP_LCSuffix_Cache[i][j] = lcSuffix;
                            if (lcSuffix > lcsSubstring.Length)
                            {
                                lcsSubstring.Length = lcSuffix;
                                lcsSubstring.indices.Clear();
                                var t = new Tuple<int, int>(i, j);
                                lcsSubstring.indices.Add(t);
                            }
                            else if(lcSuffix == lcsSubstring.Length)
                            {
                                //may be more than one longest common substring
                                lcsSubstring.indices.Add(new Tuple<int, int>(i, j));
                            }
                        }
                        else
                        {
                            DP_LCSuffix_Cache[i][j] = 0;
                        }
                    }
                }
                if(lcsSubstring.Length > 0)
                {
                    List<string> substrings = new List<string>();
                    foreach(Tuple<int, int> indices in lcsSubstring.indices)
                    {
                        string s = string.Empty;
                        int i = indices.Item1 - lcsSubstring.Length;
                        int j = indices.Item2 - lcsSubstring.Length;
                        Assert.IsTrue(DP_LCSuffix_Cache[i][j] == 0);
                        for(int l =0; l<lcsSubstring.Length;l++)
                        {
                            s += A[i];
                            Assert.IsTrue(A[i] == B[j]);
                            i++;
                            j++;
                        }
                        Assert.IsTrue(i == indices.Item1);
                        Assert.IsTrue(j == indices.Item2);
                        Assert.IsTrue(DP_LCSuffix_Cache[i][j] == lcsSubstring.Length);
                        substrings.Add(s);
                    }
                    return substrings.ToArray();
                }
                return new string[0];
            }
    

    Where unit tests are:

    [TestMethod]
            public void LCSubstringTests()
            {
                string A = "ABABC", B = "BABCA";
                string[] substrings = this.LongestCommonSubStrings(A, B);
                Assert.IsTrue(substrings.Length == 1);
                Assert.IsTrue(substrings[0] == "BABC");
                A = "ABCXYZ"; B = "XYZABC";
                substrings = this.LongestCommonSubStrings(A, B);
                Assert.IsTrue(substrings.Length == 2);
                Assert.IsTrue(substrings.Any(s => s == "ABC"));
                Assert.IsTrue(substrings.Any(s => s == "XYZ"));
                A = "ABC"; B = "UVWXYZ";
                string substring = "";
                for(int i =1;i<=10;i++)
                {
                    A += i;
                    B += i;
                    substring += i;
                    substrings = this.LongestCommonSubStrings(A, B);
                    Assert.IsTrue(substrings.Length == 1);
                    Assert.IsTrue(substrings[0] == substring);
                }
            }
    
    0 讨论(0)
提交回复
热议问题