During the last weeks I tried to figure out how to efficiently find a string pattern within another string.
I found out that for a long time, the most efficient way
I think @Erti-Chris Eelmaa 's algorithm is wrong.
L ... 'M ... M ... M' ... R
|-----|-----|
Left sub range and right sub range should all contains M. Therefore we cannot do normal segment tree partition for LCP-LR array. Code should look like
def lcp_from_i_j(i, j): # means [i, j] not [i, j)
if (j-i<1) return lcp_2_elem(i, j)
return lcp_merge(lcp_from_i_j(i, (i+j)/2), lcp_from_i_j((i+j)/2, j)
The left and the right sub ranges overlap. The segment tree supports range-min query. However, range min between [a,b] is not equal to lcp between [a,b]. LCP array is continuous, simple range-min would not work!
The termin that can help you: enchanced suffix array
, which is used to describe suffix array with various other arrays in order to replace suffix tree (lcp, child).
These can be some of the examples:
https://code.google.com/p/esaxx/ ESAXX
http://bibiserv.techfak.uni-bielefeld.de/mkesa/ MKESA
The esaxx one seems to be doing what you want, plus, it has example enumSubstring.cpp how to use it.
If you take a look at the referenced paper, it mentions an useful property (4.2)
. Since SO does not support math, there is no point to copy it here.
I've done quick implementation, it uses segment tree:
// note that arrSize is O(n)
// int arrSize = 2 * 2 ^ (log(N) + 1) + 1; // start from 1
// LCP = new int[N];
// fill the LCP...
// LCP_LR = new int[arrSize];
// memset(LCP_LR, maxValueOfInteger, arrSize);
//
// init: buildLCP_LR(1, 1, N);
// LCP_LR[1] == [1..N]
// LCP_LR[2] == [1..N/2]
// LCP_LR[3] == [N/2+1 .. N]
// rangeI = LCP_LR[i]
// rangeILeft = LCP_LR[2 * i]
// rangeIRight = LCP_LR[2 * i + 1]
// ..etc
void buildLCP_LR(int index, int low, int high)
{
if(low == high)
{
LCP_LR[index] = LCP[low];
return;
}
int mid = (low + high) / 2;
buildLCP_LR(2*index, low, mid);
buildLCP_LR(2*index+1, mid + 1, high);
LCP_LR[index] = min(LCP_LR[2*index], LCP_LR[2*index + 1]);
}
Here is a fairly simple implementation in C++, though the build()
procedure builds the suffix array in O(N lg^2 N)
time. The lcp_compute()
procedure has linear complexity. I have used this code in many programming contests, and it has never let me down :)
#include <stdio.h>
#include <string.h>
#include <algorithm>
using namespace std;
const int MAX = 200005;
char str[MAX];
int N, h, sa[MAX], pos[MAX], tmp[MAX], lcp[MAX];
bool compare(int i, int j) {
if(pos[i] != pos[j]) return pos[i] < pos[j]; // compare by the first h chars
i += h, j += h; // if prefvious comparing failed, use 2*h chars
return (i < N && j < N) ? pos[i] < pos[j] : i > j; // return results
}
void build() {
N = strlen(str);
for(int i=0; i<N; ++i) sa[i] = i, pos[i] = str[i]; // initialize variables
for(h=1;;h<<=1) {
sort(sa, sa+N, compare); // sort suffixes
for(int i=0; i<N-1; ++i) tmp[i+1] = tmp[i] + compare(sa[i], sa[i+1]); // bucket suffixes
for(int i=0; i<N; ++i) pos[sa[i]] = tmp[i]; // update pos (reverse mapping of suffix array)
if(tmp[N-1] == N-1) break; // check if done
}
}
void lcp_compute() {
for(int i=0, k=0; i<N; ++i)
if(pos[i] != N-1) {
for(int j=sa[pos[i]+1]; str[i+k] == str[j+k];) k++;
lcp[pos[i]] = k;
if(k) k--;
}
}
int main() {
scanf("%s", str);
build();
for(int i=0; i<N; ++i) printf("%d\n", sa[i]);
return 0;
}
Note: If you want the complexity of the build()
procedure to become O(N lg N)
, you can replace the STL sort with radix sort, but this is going to complicate the code.
Edit: Sorry, I misunderstood your question. Although i haven't implemented string matching with suffix array, I think I can describe you a simple non-standard, but fairly efficient algorithm for string matching. You are given two strings, the text
, and the pattern
. Given these string you create a new one, lets call it concat
, which is the concatenation of the two given strings (first the text
, then the pattern
). You run the suffix array construction algorithm on concat
, and you build the normal lcp array. Then, you search for a suffix of length pattern.size()
in the suffix array you just built. Lets call its position in the suffix array pos
. You then need two pointers lo
and hi
. At start lo = hi = pos
. You decrease lo
while lcp(lo, pos) = pattern.size()
and you increase hi
while lcp(hi, pos) = pattern.size()
. Then you search for a suffix of length at least 2*pattern.size()
in the range [lo, hi]
. If you find it, you found a match. Otherwise, no match exists.
Edit[2]: I will be back with an implementation as soon as I have one...
Edit[3]:
Here it is:
// It works assuming you have builded the concatenated string and
// computed the suffix and the lcp arrays
// text.length() ---> tlen
// pattern.length() ---> plen
// concatenated string: str
bool match(int tlen, int plen) {
int total = tlen + plen;
int pos = -1;
for(int i=0; i<total; ++i)
if(total-sa[i] == plen)
{ pos = i; break; }
if(pos == -1) return false;
int lo, hi;
lo = hi = pos;
while(lo-1 >= 0 && lcp[lo-1] >= plen) lo--;
while(hi+1 < N && lcp[hi] >= plen) hi++;
for(int i=lo; i<=hi; ++i)
if(total-sa[i] >= 2*plen)
return true;
return false;
}
Here is a nice post including some code to help you better understand LCP array and comparison implementation.
I understand your desire is the code, rather than implementing your own. Although written in Java this is an implementation of Suffix Array with LCP by Sedgewick and Wayne from their Algorithms booksite. It should save you some time and should not be tremendously hard to port to C/C++.
LCP array construction in pseudo for those who might want more information about the algorithm.