Finding all the common substrings of given two strings

前端 未结 2 595
臣服心动
臣服心动 2020-12-12 18:22

I have come across a problem statement to find the all the common sub-strings between the given two sub-strings such a way that in every case you have to print the l

2条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-12-12 19:13

    You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-tree and dynamic-programming.

    The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring:

    public static Set longestCommonSubstrings(String s, String t) {
        int[][] table = new int[s.length()][t.length()];
        int longest = 0;
        Set result = new HashSet<>();
    
        for (int i = 0; i < s.length(); i++) {
            for (int j = 0; j < t.length(); j++) {
                if (s.charAt(i) != t.charAt(j)) {
                    continue;
                }
    
                table[i][j] = (i == 0 || j == 0) ? 1
                                                 : 1 + table[i - 1][j - 1];
                if (table[i][j] > longest) {
                    longest = table[i][j];
                    result.clear();
                }
                if (table[i][j] == longest) {
                    result.add(s.substring(i - longest + 1, i + 1));
                }
            }
        }
        return result;
    }
    

    Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs eatsleepnightxyz and eatsleepabcxyz:

      e a t s l e e p a b c x y z
    e 1 0 0 0 0 1 1 0 0 0 0 0 0 0
    a 0 2 0 0 0 0 0 0 1 0 0 0 0 0
    t 0 0 3 0 0 0 0 0 0 0 0 0 0 0
    s 0 0 0 4 0 0 0 0 0 0 0 0 0 0
    l 0 0 0 0 5 0 0 0 0 0 0 0 0 0
    e 1 0 0 0 0 6 1 0 0 0 0 0 0 0
    e 1 0 0 0 0 1 7 0 0 0 0 0 0 0
    p 0 0 0 0 0 0 0 8 0 0 0 0 0 0
    n 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    i 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    g 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    h 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    t 0 0 1 0 0 0 0 0 0 0 0 0 0 0
    x 0 0 0 0 0 0 0 0 0 0 0 1 0 0
    y 0 0 0 0 0 0 0 0 0 0 0 0 2 0
    z 0 0 0 0 0 0 0 0 0 0 0 0 0 3
    
    • The eatsleep result is obvious: that's the 12345678 diagonal streak at the top-left.
    • The xyz result is the 123 diagonal at the bottom-right.
    • The a result is indicated by the 1 near the top (second row, ninth column).
    • The t result is indicated by the 1 near the bottom left.

    What about the other 1s at the left, the top, and next to the 6 and 7? Those don't count because they appear within the rectangle formed by the 12345678 diagonal — in other words, they are already covered by eatsleep.

    I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.

提交回复
热议问题