Optimize speed of Levenshtein distance of many words

廉价感情. 提交于 2019-12-23 01:28:39

问题


I have a cell array dictionary which contains a lot of words (ca. 15000).

I want to compute the function strdist (to calculate the Levenshtein distance) for all the couples of words. I tried in two ways, but they are both really slow. What can be a more efficient solution?

Here is my code (dict_keys is my cell array of length m):

1)

matrix = sparse(m,m);
for i = 1:m-1;
    matrix(i,:) = cellfun( @(u) strdist(dict_keys{i},u), dict_keys );
end

2)

matrix = sparse(m,m);
for i = 1:m-1;
  for j = i+1:m
     matrix(i,j) = strdist(dict_keys{i},dict_keys{j});
  end   
end

回答1:


Function 'strdist' is not an inbuilt matlab function, so I guess you took if from the File Exchange. That's also why both your approaches are roughly equal in time, cellfun internally just expands into a loop.

If strdist is symmetric, i.e. strdist(a,b)==strdist(b,a) you can however save half the computations. This seems to be the case, so only calculate all cases of j<i in the second loop (which you are doing).

Otherwise you could implement strdist in C as a mex function and probably see some significant speed improvements. A C implementation of the Levenshtein distance can be found for example at rosettacode.org.

Or dig into the details of how the algorithm computes the distance of two strings and see if you can vectorize it and reduce the runtime from quadratic so less. This however is probably not very easy.

Finally if you have the Parallel Computing Toolbox licensed and a multicore CPU you can easily parallelize your code since the strdist calls are completely independent of each other.




回答2:


There are several much faster methods like Levenshtain Automata. See

  1. http://en.wikipedia.org/wiki/Levenshtein_automaton
  2. http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata
  3. https://www.google.com.ng/search?q=Fast+approximate+search+in+large+dictionaries (many different papers. You can also go by (reverse) references from papers on CiteSeerX.


来源:https://stackoverflow.com/questions/27274508/optimize-speed-of-levenshtein-distance-of-many-words

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!