PHP Detect Duplicate Text

后端 未结 9 1547
夕颜
夕颜 2021-02-05 07:52

I have a site where users can put in a description about themselves.

Most users write something appropriate but some just copy/paste the same text a number of times (to

9条回答
  •  忘掉有多难
    2021-02-05 08:26

    I am not sure whether it is a good idea to combat such problem. If a person wants to put junk in aboutme field, they will always come up with the idea how to do it. But I will ignore this fact and combat the problem as an algorithmic challenge:

    Having a string S, which consists of the substrings (which can appear many times and non-overlapping) find the substring it consist of.

    The definition is louse and I assume that the string is already converted to lowercase.

    First an easier way:


    Use modification of a longest common subsequence which has an easy DP programming solution. But instead of finding a subsequence in two different sequences, you can find longest common subsequence of the string with respect to the same string LCS(s, s).

    It sounds stupid at the beginning (surely LCS(s, s) == s), but we actually do not care about the answer, we care about the DP matrix that it get.

    Let's look at the example: s = "abcabcabc" and the matrix is:

    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    [0, 1, 0, 0, 1, 0, 0, 1, 0, 0]
    [0, 0, 2, 0, 0, 2, 0, 0, 2, 0]
    [0, 0, 0, 3, 0, 0, 3, 0, 0, 3]
    [0, 1, 0, 0, 4, 0, 0, 4, 0, 0]
    [0, 0, 2, 0, 0, 5, 0, 0, 5, 0]
    [0, 0, 0, 3, 0, 0, 6, 0, 0, 6]
    [0, 1, 0, 0, 4, 0, 0, 7, 0, 0]
    [0, 0, 2, 0, 0, 5, 0, 0, 8, 0]
    [0, 0, 0, 3, 0, 0, 6, 0, 0, 9]
    

    Note the nice diagonals there. As you see the first diagonal ends with 3, second with 6 and third with 9 (our original DP solution which we do not care).

    This is not a coincidence. Hope that after looking in more details about how DP matrix is constructed you can see that these diagonals correspond to duplicate strings.

    Here is an example for s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas" and the very last row in the matrix is: [0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 17, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 34, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 51, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 68].

    As you see big numbers (17, 34, 51, 68) there correspond to the end of the diagonals (there is also some noise there just because I specifically added small duplicate letters like aaa).

    Which suggest that we can just find the gcd of biggest two numbers gcd(68, 51) = 17 which will be the length of our repeated substring.

    Here just because we know that the the whole string consists of repeated substrings, we know that it starts at the 0-th position (if we do not know it we would need to find the offset).

    And here we go: the string is "aaabasdfwasfsdtas".

    P.S. this method allows you to find repeats even if they are slightly modified.

    For people who would like to play around here is a python script (which was created in a hustle so feel free to improve):

    def longest_common_substring(s1, s2):
        m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
        longest, x_longest = 0, 0
        for x in xrange(1, 1 + len(s1)):
            for y in xrange(1, 1 + len(s2)):
                if s1[x - 1] == s2[y - 1]:
                    m[x][y] = m[x - 1][y - 1] + 1
                    if m[x][y] > longest:
                        longest = m[x][y]
                else:
                    m[x][y] = 0
        return m
    
    s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
    m = longest_common_substring(s, s)
    import numpy as np
    import matplotlib.pyplot as plt
    import matplotlib.cm as cm
    M = np.array(m)
    print m[-1]
    arr = np.asarray(M)
    plt.imshow(arr, cmap = cm.Greys_r, interpolation='none')
    plt.show()
    

    I told about the easy way, and forgot to write about the hard way. It is getting late, so I will just explain the idea. The implementation is harder and I am not sure whether it will give you better results. But here it is:

    Use the algorithm for longest repeated substring (you will need to implement trie or suffix tree which is not easy in php).

    After this:

    s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
    s1 = largest_substring_algo1(s)
    

    Took the implementation of largest_substring_algo1 from here. Actually it is not the best (just for showing the idea) as it does not use the above-mention data-structures. The results for s and s1 are:

    aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas
    aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaa
    

    As you see the difference between them is actually the substring which was duplicated.

提交回复
热议问题