Algorithm to find common substring across N strings

前端 未结 2 813
小鲜肉
小鲜肉 2020-12-06 07:24

I\'m familiar with LCS algorithms for 2 strings. Looking for suggestions for finding common substrings in 2..N strings. There may be multiple common substrings in each pair.

相关标签:
2条回答
  • 2020-12-06 07:49

    This sort of thing is done all the time in DNA sequence analysis. You can find a variety of algorithms for it. One reasonable collection is listed here.

    There's also the brute-force approach of making tables of every substring (if you're interested only in short ones): form an N-ary tree (N=26 for letters, 256 for ASCII) at each level, and store histograms of the count at every node. If you prune off little-used nodes (to keep the memory requirements reasonable), you end up with an algorithm that finds all subsequences of length up to M in something like N*M^2*log(M) time for input of length N. If you instead split this up into K separate strings, you can build the tree structure and just read off the answer(s) in a single pass through the tree.

    0 讨论(0)
  • 2020-12-06 07:54

    SUffix trees are the answer unless you have really large strings where memory becomes a problem. Expect 10~30 bytes of memory usage per character in the string for a good implementation. There are a couple of open-source implementations too, which make your job easier.

    There are other, more succint algorithms too, but they are harder to implement (look for "compressed suffix trees").

    0 讨论(0)
提交回复
热议问题