Algorithm to find common substring across N strings

前端未结

关注

 2  814

I\'m familiar with LCS algorithms for 2 strings. Looking for suggestions for finding common substrings in 2..N strings. There may be multiple common substrings in each pair.

相关标签:

2条回答

情歌与酒

2020-12-06 07:49

This sort of thing is done all the time in DNA sequence analysis. You can find a variety of algorithms for it. One reasonable collection is listed here.

There's also the brute-force approach of making tables of every substring (if you're interested only in short ones): form an N-ary tree (N=26 for letters, 256 for ASCII) at each level, and store histograms of the count at every node. If you prune off little-used nodes (to keep the memory requirements reasonable), you end up with an algorithm that finds all subsequences of length up to M in something like N*M^2*log(M) time for input of length N. If you instead split this up into K separate strings, you can build the tree structure and just read off the answer(s) in a single pass through the tree.

0 讨论(0)
发布评论:

提交评论
- 加载中...
予麋鹿

2020-12-06 07:54

SUffix trees are the answer unless you have really large strings where memory becomes a problem. Expect 10~30 bytes of memory usage per character in the string for a good implementation. There are a couple of open-source implementations too, which make your job easier.

There are other, more succint algorithms too, but they are harder to implement (look for "compressed suffix trees").

0 讨论(0)
发布评论:

提交评论
- 加载中...