Find K-most longest common suffix in a set of strings

╄→尐↘猪︶ㄣ 提交于 2019-12-13 04:02:02

问题


I want to find most longest common suffix in a set of strings to detect some potential important morpheme in my natural language process project.

Given frequency K>=2,find the K-most common longest suffix in a list of strings S1,S2,S3...SN

To simplify the problem, here are some examples:

Input1:

 K=2
 S=["fireman","woman","businessman","policeman","businesswoman"]

Output1:

 ["man","eman","woman"]

Explanation1:

"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times


It would be appreciated if the output also tracked the frequency of each common suffix

{"man":4,"eman":2,"woman":2}

It is not guaranteed that each word will have at least one-length common suffix, see the example below.

Input2:

 K=2
 S=["fireman","woman","businessman","policeman","businesswoman","apple","pineapple","people"]

Output2:

 ["man","eman","woman","ple","apple"]

Explanation2:

"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times

"ple" occur 3 times, "apple" occur 2 times


Any efficient algorithm to solve this problem ?

I have referred to Suffix Tree and Generalised Suffix Tree algorithm but it seems not to fit this problem well.

(BTW, I am researching on a Chinese NLP project which means there are far more characters than English has)


回答1:


If you want to be really efficient this paper could help. The algorithm gives you the most frequent (longest) substrings from a set of strings in O(n log n). The basic idea is the following:

  1. Concatenate all strings with whitespace between them (these are used as seperators later) in order to from a single long string

  2. Build a suffix array SA over the long string

  3. For each pair of adjacent entries in the suffix array, check how many of the starting characters are equal of the two suffixes that the entries refer to (you should stop when you reach a whitespace). Put the count of equal characters in an additional array (called LCP). Each entry LCP[i] refers to the count of equal characters of the string at position SA[i-1] and SA[i] in the long string.

  4. Go over the LCP an find successive entries with a count larger than K.

Example from the paper for3 strings

s1= bbabab

s2 = abacac

s3 = bbaaa



来源:https://stackoverflow.com/questions/47927878/find-k-most-longest-common-suffix-in-a-set-of-strings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!