Levenstein distance limit

大城市里の小女人 提交于 2020-01-02 08:51:51

问题


If I have some distance which I do not want to exceed. Example = 2. Do I can break from algoritm before its complete completion knowing the minimum allowable distance?

Perhaps there are similar algorithms in which it can be done.

It is necessary for me to reduce the time of work programs.


回答1:


If you do top-down dynamic programming/recursion + memoization, you could pass the current size as an extra parameter and return early if it exceeds 2. But I think this will be inefficient because you will revisit states.

If you do bottom-up dp, you will fill row by row (you only have to keep the last and current row). If the last row only has entries greater than 2, you can terminate early.

Modify your source code according to my comment:

for (var i = 1; i <= source1Length; i++)
{
                for (var j = 1; j <= source2Length; j++)
                {
                    var cost = (source2[j - 1] == source1[i - 1]) ? 0 : 1;

                    matrix[i, j] = Math.Min(
                        Math.Min(matrix[i - 1, j] + 1, matrix[i, j - 1] + 1),
                        matrix[i - 1, j - 1] + cost);
                }
                // modify here:
                // check here if matrix[i,...] is completely > 2, if yes, break

}



回答2:


Yes you can and it does reduce the complexity.

The main thing to observe is that levenstein_distance(a,b) >= |len(a) - len(b)| That is the distance can't be less than the difference in the lengths of the strings. At the very minimum you need to add characters to make them the same length.

Knowing this you can ignore all the cells in the original matrix where |i-j| > max_distance. So you can modify your loops from

for (i in 0 -> len(a))
   for (j in 0 -> len(b))

to

for (i in 0-> len(a))
   for (j in max(0,i-max_distance) -> min(len(b), i+max_distance)) 

You can keep the original matrix if it's easier for you, but you can also save space by having a matrix of (len(a), 2*max_distance) and adjusting the indices.

Once every cost you have in the last row > max_distance you can stop the algorithm.

This will give you O(N*max_distance) complexity. Since your max_distance is 2 the complexity is almost linear. You can also bail at the start is |len(a)-len(b)| > max_distance.



来源:https://stackoverflow.com/questions/48901351/levenstein-distance-limit

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!