I\'m looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ p
An optimal solution for this problem is to use a suffix tree (or a suffix array). It's essentially a trie of all suffixes of a string. For a text of length O(N)
, this can be built in O(N)
.
Then all k
occurrences of a string of length m
can be answered optimally in O(m + k)
.
Suffix trees can also be used to efficiently find e.g. the longest palindrome, the longest common substring, the longest repeated substring, etc.
This is the typical data structure to use when analyzing DNA strings which can be millions/billions of bases long.