Optimize speed of Levenshtein distance of many words

I have a cell array dictionary which contains a lot of words (ca. 15000).

I want to compute the function strdist (to calculate the Levenshtein distance) for all the couples of words. I tried in two ways, but they are both really slow. What can be a more efficient solution?

Here is my code (dict_keys is my cell array of length m):

matrix = sparse(m,m);
for i = 1:m-1;
    matrix(i,:) = cellfun( @(u) strdist(dict_keys{i},u), dict_keys );
end

matrix = sparse(m,m);
for i = 1:m-1;
  for j = i+1:m
     matrix(i,j) = strdist(dict_keys{i},dict_keys{j});
  end   
end

Function 'strdist' is not an inbuilt matlab function, so I guess you took if from the File Exchange. That's also why both your approaches are roughly equal in time, cellfun internally just expands into a loop.

If strdist is symmetric, i.e. strdist(a,b)==strdist(b,a) you can however save half the computations. This seems to be the case, so only calculate all cases of j<i in the second loop (which you are doing).

Otherwise you could implement strdist in C as a mex function and probably see some significant speed improvements. A C implementation of the Levenshtein distance can be found for example at rosettacode.org.

Or dig into the details of how the algorithm computes the distance of two strings and see if you can vectorize it and reduce the runtime from quadratic so less. This however is probably not very easy.

Finally if you have the Parallel Computing Toolbox licensed and a multicore CPU you can easily parallelize your code since the strdist calls are completely independent of each other.

There are several much faster methods like Levenshtain Automata. See

http://en.wikipedia.org/wiki/Levenshtein_automaton
http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata
https://www.google.com.ng/search?q=Fast+approximate+search+in+large+dictionaries (many different papers. You can also go by (reverse) references from papers on CiteSeerX.

来源：https://stackoverflow.com/questions/27274508/optimize-speed-of-levenshtein-distance-of-many-words

标签

performance

matlab

levenshtein-distance