I\'m working on a fuzzy search implementation and as part of the implementation, we\'re using Apache\'s StringUtils.getLevenshteinDistance. At the moment, we\'re going for a
The issue with implementing the window is dealing with the value to the left of the first entry and above the last entry in each row.
One way is to start the values you initially fill in at 1 instead of 0, then just ignore any 0s that you encounter. You'll have to subtract 1 from your final answer.
Another way is to fill the entries left of first and above last with high values so the minimum check will never pick them. That's the way I chose when I had to implement it the other day:
public static int levenshtein(String s, String t, int threshold) {
int slen = s.length();
int tlen = t.length();
// swap so the smaller string is t; this reduces the memory usage
// of our buffers
if(tlen > slen) {
String stmp = s;
s = t;
t = stmp;
int itmp = slen;
slen = tlen;
tlen = itmp;
}
// p is the previous and d is the current distance array; dtmp is used in swaps
int[] p = new int[tlen + 1];
int[] d = new int[tlen + 1];
int[] dtmp;
// the values necessary for our threshold are written; the ones after
// must be filled with large integers since the tailing member of the threshold
// window in the bottom array will run min across them
int n = 0;
for(; n < Math.min(p.length, threshold + 1); ++n)
p[n] = n;
Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// this is the core of the Levenshtein edit distance algorithm
// instead of actually building the matrix, two arrays are swapped back and forth
// the threshold limits the amount of entries that need to be computed if we're
// looking for a match within a set distance
for(int row = 1; row < s.length()+1; ++row) {
char schar = s.charAt(row-1);
d[0] = row;
// set up our threshold window
int min = Math.max(1, row - threshold);
int max = Math.min(d.length, row + threshold + 1);
// since we're reusing arrays, we need to be sure to wipe the value left of the
// starting index; we don't have to worry about the value above the ending index
// as the arrays were initially filled with large integers and we progress to the right
if(min > 1)
d[min-1] = Integer.MAX_VALUE;
for(int col = min; col < max; ++col) {
if(schar == t.charAt(col-1))
d[col] = p[col-1];
else
// min of: diagonal, left, up
d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
}
// swap our arrays
dtmp = p;
p = d;
d = dtmp;
}
if(p[tlen] == Integer.MAX_VALUE)
return -1;
return p[tlen];
}