Longest Non-Overlapping Substring

自闭症网瘾萝莉.ら 提交于 2019-12-04 08:45:43

Since you're applying this to music, you're probably not looking for 100% matches; there will be expected discrepencies from one instance of the theme to another. You might try looking up gene sequence analysis algorithms - there's lots of this variety of stuff going on there. Try BLAST or other alignment algorithms.

You could also try the WinEPI family of algorithms, as well as various prefix tree implementations of this specific result set (these results allow gaps in the substring; for instance, ABCID and AFBCD both contain ABCD). I actually have a paper on an algorithm that might be worth looking at if you would be interested; I would need to attain distribution authorization, so let me know.

Note that this is actually a VERY complicated subject (to do efficiently) for large datasets.

Source: Research for a year or two on a comparable (theme-detection) topic.

Suffix array is a good data structure for solving that problem.

There is a solution to this problem in Programming Pearls by Jon Bentley.

A simple algorithm is to do this:

  • Create a data structure representing slices of a string; implement comparison / sorting as appropriate for the language
  • Create a list of every slice of the string starting with [first-character, last-character], [second-character, last-character], up to [last-character, last-character]
  • Sort this list - O(n log n)
  • This brings all string slices with common prefixes together. You can then iterate through the list, comparing each pair to see how many characters they share at the start, and if it's larger than your maximum then take a note of it and set it as the new maximum

(As the other reply just posted indicates, I'm describing a suffix array here.)

Okay, here's one crazy idea.

Create a function that determines if there is a recurring substring of length k in O(n) (where n is the length of the text). This can be done by using a rolling hash (see wikipedia for Rabin-Karp String Matching Algorithm) to generate all n hashes in linear time and using a hashtable/BST (or a map or dict or whatever your language calls it) to store the corresponding substring positions. Before inserting the current hash to the data structure, we check if we have seen it before. If it has been seen before, we simply look up the indices where this hash was generated and see if the corresponding substring is a non-overlapping match.

Now that we can find a k length substring in O(n) time, we simply run a binary search to find the largest k for which we can find a non-overlapping substring match using our function.

Code in Python follows

A=23
MOD=10007 # use a random value in the real world

def hsh(text):
    return sum([ord(c)*A**(len(text)-i-1) for i,c in enumerate(text)])%MOD

def k_substring(text, k):
    substrings={}
    cur=hsh(text[:k])
    pwr=(A**(k-1))%MOD
    substrings[cur]=[0]
    for i in xrange(k,len(text)):
        to_remove=(ord(text[i-k])*pwr)%MOD
        to_add=ord(text[i])
        cur-=to_remove
        if cur<0:
            cur+=MOD
        cur=(cur*A)%MOD
        cur=(cur+to_add)%MOD
        if cur in substrings:
            lst=substrings[cur]
            for index in lst:
                if index+k<=i-k+1 and text[index:index+k]==text[i-k+1:i+1]:
                    return index
            lst.append(i-k+1)
        else:
            substrings[cur]=[i-k+1]
    return -1

def longest_substring(text):
    hi,lo=len(text),0
    while hi>lo:
        mid=(hi+lo+1)>>1
        if k_substring(text,mid)==-1:
            hi=mid-1
        else:
            lo=mid
    index=k_substring(text,lo)
    return text[index:index+lo]

(Sorry if this is unclear. It is 6:30am here and I really need to go back to bed :) )

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!