Explanation of normalized edit distance formula

末鹿安然 提交于 2021-02-08 06:07:25

问题


Based on this paper: IEEE TRANSACTIONS ON PAITERN ANALYSIS : Computation of Normalized Edit Distance and Applications In this paper Normalized Edit Distance as followed:

Given two strings X and Y over a finite alphabet, the normalized edit distance between X and Y, d( X , Y ) is defined as the minimum of W( P ) / L ( P )w, here P is an editing path between X and Y , W ( P ) is the sum of the weights of the elementary edit operations of P, and L(P) is the number of these operations (length of P).

enter image description here

Can i safely translate the normalized edit distance algorithm explained above as this:

normalized edit distance = 
levenshtein(query 1, query 2)/max(length(query 1), length(query 2))

回答1:


You are probably misunderstanding the metric. There are two issues:

  1. The normalization step is to divide W(P) which is the weight of the edit procedure over L(P), which is the length of the edit procedure, not over the max length of the strings as you did;

  2. Also, the paper showed that (Example 3.1) normalized edit distance cannot be simply computed with levenshtein distance. You probably need to implement their algorithm.

An explanation of Example 3.1 (c):

From aaab to abbb, the paper used the following transformations:

  1. match a with a;
  2. skip a in the first string;
  3. skip a in the first string;
  4. skip b in the second string;
  5. skip b in the second string;
  6. match the final bs.

These are 6 operations which is why L(P) is 6; from the matrix in (a), matching has cost 0, skipping has cost 2, thus we have total cost of 0 + 2 + 2 + 2 + 2 + 0 = 8, which is exactly W(P), and W(P) / L(P) = 1.33. Similar results can be obtained for (b), which I'll left to you as exercise :-)




回答2:


The 3 in figure 2(a) refers to the cost of changing "a" to "b" or the cost of changing "b" to "a". The columns with lambdas in figure 2(a) mean that it costs 2 in order to insert or delete either an "a" or a "b".

In figure 2(b), W(P) = 6 because the algorithm does the following steps:

  1. keep first a (cost 0)
  2. convert first b to a (cost 3)
  3. convert second b to a (cost 3)
  4. keep last b (cost 0)

The sum of the costs of the steps is W(P). The number of steps is 4 which is L(P).

In figure 2(c), the steps are different:

  1. keep first a (cost 0)
  2. delete first b (cost 2)
  3. delete second b (cost 2)
  4. insert a (cost 2)
  5. insert a (cost 2)
  6. keep last b (cost 0)

In this path there are six steps so the L(P) is 6. The sum of the costs of the steps is 8 so W(P) is 8. Therefore the normalized edit distance is 8/6 = 4/3 which is about 1.33.



来源:https://stackoverflow.com/questions/30787098/explanation-of-normalized-edit-distance-formula

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!