Percentage rank of matches using Levenshtein Distance matching

后端未结

关注

 6  1849

I am trying to match a single search term against a dictionary of possible matches using a Levenshtein distance algorithm. The algorithm returns a distance expressed as numb

相关标签:

6条回答

臣服心动

2020-12-14 02:25
I had a similar problem and this thread helped me to figure out a solution. Hope it can help others too.
```
int levDis = Lev_distance(Q, Mi)
int bigger = max(strlen(Q), strlen(Mi))
double pct = (bigger - levDis) / bigger
```
It should return 100% if both strings are exactly the same and 0% if they are totaly different.

(sorry if my english isn't that good)
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2020-12-14 02:27
```
(1 - (levNum / Math.max(s.length,t.length) ) ) *100
```
should be correct
0 讨论(0)
发布评论:

提交评论
- 加载中...
执念已碎

2020-12-14 02:42
My approach to this problem was by calculating maximum allowed operations, which is what Levenshtein distance is. The formula I used is:
```
percent = 0.75; // at least 75% of string must match
maxOperationsFirst = s1.length() - s1.length() * percent;
maxOperationsSecond = s2.length() - s2.length() * percent;
maxOperations = round(min(maxOperationsFirst, maxOperationsSecond));
```
It calculates maximum operations for each string, I believe that the calculation is easy to understand. I use the minimum value of both results and round it to closest whole number. You can skip this part and use just value of max operations from either of strings, it really depends on your data.

Once you've got the number of maximum operations, you can compare it with levenshtein result and determine if the string is acceptable. This way you can use any extended levenshtein methods, for example Damerau–Levenshtein distance, which count misspelling, e.g. test -> tset, only as 1 operation, which is quite useful when checking user input where those misspellings occur very often.

I hope this helps you get an idea on how to solve this problem.
0 讨论(0)
发布评论:

提交评论
- 加载中...

甜味超标

2020-12-14 02:43

Maximum number of levenshtein distance is [l1, l2].max. I think it is true. But we shouldn't divide by it.

gem install levenshtein diff-lcs

Diff::LCS.lcs "abc", "qwer"
=> []
Levenshtein.distance("abc", "qwer").to_f / [3, 4].max
=> 1.0

Diff::LCS.lcs "abc", "cdef"
=> ["c"]
Levenshtein.distance("abc", "cdef").to_f / [3, 4].max
=> 1.0

Diff::LCS.lcs "1234", "34567890"
=> ["3", "4"]
Levenshtein.distance("1234", "34567890").to_f / [4, 8].max
=> 1.0

Levenshtein doesn't look like reliable way to compare strings in percents. I don't want to treat similar strings as 100% different.

I can recommend just to analyze diff between each sequence and LCS.

def get_similarity(sequence_1, sequence_2)
  lcs_length = Diff::LCS::Internals.lcs(sequence_1, sequence_2).compact.length
  lcs_length.to_f * 2 / (sequence_1.length + sequence_2.length)
end

0 讨论(0)

别跟我提以往

2020-12-14 02:44
What about this one:
```
100 - ( ((2*Lev_distance(Q, Mi)) / (Q.length + Mi.length)) * 100 )
```
It gives same distance on (Q, M1) and (Q,M2)
0 讨论(0)
发布评论:

提交评论
- 加载中...
遇见更好的自我

2020-12-14 02:50
This is essentially option 2 mentioned in my question. However let me demonstrate a problem with that approach.
```
Q = "ABC Corp" (len = 8)
M1 = "ABC"
M2 = "ABC Corporati"
M3 = "ABC Corp"
```
I have chosen M1 and M2 such that their Lev distances are same (5 each). Using option 2, the match percentages would be
```
M1 = (1 - 5/8)*100  = 37.5%
M2 = (1 - 5/13)*100 = 61.5%
M3 = 100%
```
As you can see if I present the matches in that order, there is a huge rank difference between M1 and M2, even though they have the exact same lev distance. You see the problem?
0 讨论(0)
发布评论:

提交评论
- 加载中...