Percentage rank of matches using Levenshtein Distance matching

后端未结

关注

 6  1851

既然无缘 2020-12-14 02:19

I am trying to match a single search term against a dictionary of possible matches using a Levenshtein distance algorithm. The algorithm returns a distance expressed as numb

6条回答

甜味超标 (楼主)

2020-12-14 02:43

Maximum number of levenshtein distance is [l1, l2].max. I think it is true. But we shouldn't divide by it.

gem install levenshtein diff-lcs

Diff::LCS.lcs "abc", "qwer"
=> []
Levenshtein.distance("abc", "qwer").to_f / [3, 4].max
=> 1.0

Diff::LCS.lcs "abc", "cdef"
=> ["c"]
Levenshtein.distance("abc", "cdef").to_f / [3, 4].max
=> 1.0

Diff::LCS.lcs "1234", "34567890"
=> ["3", "4"]
Levenshtein.distance("1234", "34567890").to_f / [4, 8].max
=> 1.0

Levenshtein doesn't look like reliable way to compare strings in percents. I don't want to treat similar strings as 100% different.

I can recommend just to analyze diff between each sequence and LCS.

def get_similarity(sequence_1, sequence_2)
  lcs_length = Diff::LCS::Internals.lcs(sequence_1, sequence_2).compact.length
  lcs_length.to_f * 2 / (sequence_1.length + sequence_2.length)
end

0 讨论(0)

查看其它6个回答