Percentage rank of matches using Levenshtein Distance matching

后端 未结 6 1851
既然无缘
既然无缘 2020-12-14 02:19

I am trying to match a single search term against a dictionary of possible matches using a Levenshtein distance algorithm. The algorithm returns a distance expressed as numb

6条回答
  •  甜味超标
    2020-12-14 02:43

    Maximum number of levenshtein distance is [l1, l2].max. I think it is true. But we shouldn't divide by it.

    gem install levenshtein diff-lcs
    
    Diff::LCS.lcs "abc", "qwer"
    => []
    Levenshtein.distance("abc", "qwer").to_f / [3, 4].max
    => 1.0
    
    Diff::LCS.lcs "abc", "cdef"
    => ["c"]
    Levenshtein.distance("abc", "cdef").to_f / [3, 4].max
    => 1.0
    
    Diff::LCS.lcs "1234", "34567890"
    => ["3", "4"]
    Levenshtein.distance("1234", "34567890").to_f / [4, 8].max
    => 1.0
    

    Levenshtein doesn't look like reliable way to compare strings in percents. I don't want to treat similar strings as 100% different.

    I can recommend just to analyze diff between each sequence and LCS.

    def get_similarity(sequence_1, sequence_2)
      lcs_length = Diff::LCS::Internals.lcs(sequence_1, sequence_2).compact.length
      lcs_length.to_f * 2 / (sequence_1.length + sequence_2.length)
    end
    

提交回复
热议问题