Finding how similar two strings are

萝らか妹 提交于 2019-11-27 03:18:38

Ok, so the standard algorithms are:

1) Hamming distance Only good for strings of the same length, but very efficient. Basically it simply counts the number of distinct characters. Not useful for fuzzy searching of natural language text.

2) Levenstein distance. The Levenstein distance measures distance in terms of the number of "operations" required to transform one string to another. These operations include insertion, deletion and substition. The standard approach of calculating the Levenstein distance is to use dynamic programming.

3) Generalized Levenstein/(Damerau–Levenshtein distance) This distance also takes into consideration transpositions of characters in a word, and is probably the edit distance most suited for fuzzy matching of manually-entered text. The algorithm to compute the distance is a bit more involved than the Levenstein distance (detecting transpositions is not easy). Most common implementations are a modification of the bitap algorithm (like grep).

In general you would probably want to consider an implementation of the third option implemented in some sort of nearest neighbour search based on a k-d tree

the Damerau-Levenshtein distance is similar to the Levenshtein distance, but also includes two-character transposition. the wikipedia page (linked) includes pseudocode that should be fairly trivial to implement.

  • Levenstein distance
  • Hamming distance
  • soundex
  • metaphone

You're looking for the Levenshtein distance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!