Efficient Algorithm for String Concatenation with Overlap

前端 未结 12 2250
轻奢々
轻奢々 2020-12-02 21:32

We need to combine 3 columns in a database by concatenation. However, the 3 columns may contain overlapping parts and the parts should not be duplicated. For example,

<
12条回答
  •  情书的邮戳
    2020-12-02 21:58

    Most of the other answers have focused on constant-factor optimizations, but it's also possible to do asymptotically better. Look at your algorithm: it's O(N^2). This seems like a problem that can be solved much faster than that!

    Consider Knuth Morris Pratt. It keeps track of the maximum amount of substring we have matched so far throughout. That means it knows how much of S1 has been matched at the end of S2, and that's the value we're looking for! Just modify the algorithm to continue instead of returning when it matches the substring early on, and have it return the amount matched instead of 0 at the end.

    That gives you an O(n) algorithm. Nice!

        int OverlappedStringLength(string s1, string s2) {
            //Trim s1 so it isn't longer than s2
            if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length);
    
            int[] T = ComputeBackTrackTable(s2); //O(n)
    
            int m = 0;
            int i = 0;
            while (m + i < s1.Length) {
                if (s2[i] == s1[m + i]) {
                    i += 1;
                    //<-- removed the return case here, because |s1| <= |s2|
                } else {
                    m += i - T[i];
                    if (i > 0) i = T[i];
                }
            }
    
            return i; //<-- changed the return here to return characters matched
        }
    
        int[] ComputeBackTrackTable(string s) {
            var T = new int[s.Length];
            int cnd = 0;
            T[0] = -1;
            T[1] = 0;
            int pos = 2;
            while (pos < s.Length) {
                if (s[pos - 1] == s[cnd]) {
                    T[pos] = cnd + 1;
                    pos += 1;
                    cnd += 1;
                } else if (cnd > 0) {
                    cnd = T[cnd];
                } else {
                    T[pos] = 0;
                    pos += 1;
                }
            }
    
            return T;
        }
    

    OverlappedStringLength("abcdef", "defghl") returns 3

提交回复
热议问题