Efficient Algorithm for String Concatenation with Overlap

前端 未结 12 2248
轻奢々
轻奢々 2020-12-02 21:32

We need to combine 3 columns in a database by concatenation. However, the 3 columns may contain overlapping parts and the parts should not be duplicated. For example,

<
12条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-02 21:50

    Here is a Java implementation which finds the maximum overlap between two strings with length N and M in something like O(min(N,M)) operations ~ O(N).

    I had the same idea as @sepp2k:s now deleted answer, and worked a bit further on it. Seems to work fine. The idea is to iterate through the first string and start tracking once you find something which matches the start of the second string. Figured out that you might need to do multiple simultaneous trackings if false and true matches are overlapping. At the end you choose the longest track.

    I havent worked out the absolutely worst case yet, with maximal overlap between matches, but I don't expect it to spiral out of control since I think that you cannot overlap arbitrary many matches. Normally you only track one or two matches at a time: the candidates are removed as soon as there is a mismatch.

    static class Candidate {
        int matchLen = 0;
    }
    
    private String overlapOnce(@NotNull final String a, @NotNull final String b) {
        final int maxOverlap = Math.min(a.length(), b.length());
        final Collection candidates = new LinkedList<>();
        for (int i = a.length() - maxOverlap; i < a.length(); ++i) {
            if (a.charAt(i) == b.charAt(0)) {
                candidates.add(new Candidate());
            }
            for (final Iterator it = candidates.iterator(); it.hasNext(); ) {
                final Candidate candidate = it.next();
                if (a.charAt(i) == b.charAt(candidate.matchLen)) {
                    //advance
                    ++candidate.matchLen;
                } else {
                    //not matching anymore, remove
                    it.remove();
                }
            }
    
        }
        final int matchLen = candidates.isEmpty() ? 0 :
                candidates.stream().map(c -> c.matchLen).max(Comparator.comparingInt(l -> l)).get();
        return a + b.substring(matchLen);
    }
    
    private String overlapOnce(@NotNull final String... strings) {
        return Arrays.stream(strings).reduce("", this::overlapOnce);
    }
    

    And some tests:

    @Test
    public void testOverlapOnce() throws Exception {
        assertEquals("", overlapOnce("", ""));
        assertEquals("ab", overlapOnce("a", "b"));
        assertEquals("abc", overlapOnce("ab", "bc"));
        assertEquals("abcdefghqabcdefghi", overlapOnce("abcdefgh", "efghqabcdefghi"));
        assertEquals("aaaaaabaaaaaa", overlapOnce("aaaaaab", "baaaaaa"));
        assertEquals("ccc", overlapOnce("ccc", "ccc"));
        assertEquals("abcabc", overlapOnce("abcabc", "abcabc"));
    
        /**
         *  "a" + "b" + "c" => "abc"
         "abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
         "abcdede" + "dedefgh" + "" => "abcdedefgh"
         "abcde" + "d" + "ghlmn" => "abcdedghlmn"
         "abcdef" + "" + "defghl" => "abcdefghl"
         */
        assertEquals("abc", overlapOnce("a", "b", "c"));
        assertEquals("abcdefghlmn", overlapOnce("abcde", "defgh", "ghlmn"));
        assertEquals("abcdedefgh", overlapOnce("abcdede", "dedefgh"));
        assertEquals("abcdedghlmn", overlapOnce("abcde", "d", "ghlmn"));
        assertEquals("abcdefghl", overlapOnce("abcdef", "", "defghl"));
    
    
        // Consider str1=abXabXabXac and str2=XabXac. Your approach will output abXabXabXacXabXac because by
        // resetting j=0, it goes to far back.
        assertEquals("abXabXabXac", overlapOnce("abXabXabXac", "XabXac"));
    
        // Try to trick algo with an earlier false match overlapping with the real match
        //  - match first "aba" and miss that the last "a" is the start of the
        // real match
        assertEquals("ababa--", overlapOnce("ababa", "aba--"));
    }
    

提交回复
热议问题