Finding all the common substrings of given two strings

前端 未结 2 582
臣服心动
臣服心动 2020-12-12 18:22

I have come across a problem statement to find the all the common sub-strings between the given two sub-strings such a way that in every case you have to print the l

相关标签:
2条回答
  • 2020-12-12 18:59

    Typically this type of substring matching is done with the assistance of a separate data structure called a Trie (pronounced try). The specific variant that best suits this problem is a suffix tree. Your first step should be to take your inputs and build a suffix tree. Then you'll need to use the suffix tree to determine the longest common substring, which is a good exercise.

    0 讨论(0)
  • 2020-12-12 19:13

    You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-tree and dynamic-programming.

    The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring:

    public static Set<String> longestCommonSubstrings(String s, String t) {
        int[][] table = new int[s.length()][t.length()];
        int longest = 0;
        Set<String> result = new HashSet<>();
    
        for (int i = 0; i < s.length(); i++) {
            for (int j = 0; j < t.length(); j++) {
                if (s.charAt(i) != t.charAt(j)) {
                    continue;
                }
    
                table[i][j] = (i == 0 || j == 0) ? 1
                                                 : 1 + table[i - 1][j - 1];
                if (table[i][j] > longest) {
                    longest = table[i][j];
                    result.clear();
                }
                if (table[i][j] == longest) {
                    result.add(s.substring(i - longest + 1, i + 1));
                }
            }
        }
        return result;
    }
    

    Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs eatsleepnightxyz and eatsleepabcxyz:

      e a t s l e e p a b c x y z
    e 1 0 0 0 0 1 1 0 0 0 0 0 0 0
    a 0 2 0 0 0 0 0 0 1 0 0 0 0 0
    t 0 0 3 0 0 0 0 0 0 0 0 0 0 0
    s 0 0 0 4 0 0 0 0 0 0 0 0 0 0
    l 0 0 0 0 5 0 0 0 0 0 0 0 0 0
    e 1 0 0 0 0 6 1 0 0 0 0 0 0 0
    e 1 0 0 0 0 1 7 0 0 0 0 0 0 0
    p 0 0 0 0 0 0 0 8 0 0 0 0 0 0
    n 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    i 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    g 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    h 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    t 0 0 1 0 0 0 0 0 0 0 0 0 0 0
    x 0 0 0 0 0 0 0 0 0 0 0 1 0 0
    y 0 0 0 0 0 0 0 0 0 0 0 0 2 0
    z 0 0 0 0 0 0 0 0 0 0 0 0 0 3
    
    • The eatsleep result is obvious: that's the 12345678 diagonal streak at the top-left.
    • The xyz result is the 123 diagonal at the bottom-right.
    • The a result is indicated by the 1 near the top (second row, ninth column).
    • The t result is indicated by the 1 near the bottom left.

    What about the other 1s at the left, the top, and next to the 6 and 7? Those don't count because they appear within the rectangle formed by the 12345678 diagonal — in other words, they are already covered by eatsleep.

    I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.

    0 讨论(0)
提交回复
热议问题