How can I create a threshold for similar strings using Levenshtein distance and account for typos?

杀马特。学长 韩版系。学妹 提交于 2019-12-04 11:29:59

First off, Levenshtein distance is defined as the minimum number of edits required to transform string A to string B, where an edit is the insertion, or deletion of a single character, or the replacement of a character with another character. So it's very much the "difference between two strings", for a certain definition of distance. =)

It sounds like you're looking for a distance function F(A, B) that gives a distance between strings A and B and a threshold N where strings with distance less than N from each other are candidates for typos. In addition to Levenshtein distance you might also consider Needleman–Wunsch. It's basically the same thing but it lets you provide a function for how close a given character is to another character. You could use that algorithm with a set of weights that reflect the positions of keys on a QWERTY keyboard to do a pretty good job of finding typos. This would have issues with international keyboards though.

If you have k strings and you want to find potential typos, the number of comparisons you need to make is O(k^2). In addition, each comparison is O(len(A)*len(B)). So if you have a million strings you're going to find yourself in trouble if you do things naively. Here are a few suggestions on how to speed things up:

  • Apologies if this is obvious, but Levenshtein distance is symmetrical, so make sure you aren't computing F(A, B) and F(B, A).
  • abs(len(A) - len(B)) is a lower bound on the distance between strings A and B. So you can skip checking strings whose lengths are too different.

One issue you might run into is that "1st St." has a pretty high distance from "First Street", even though you probably want to consider those to be identical. The easiest way to handle this is probably to transform strings into a canonical form before doing the comparisons. So you might make all strings lowercase, use a dictionary that maps "1st" to "first", etc. That dictionary might get pretty big, but I don't know a better way to deal with this issues.

Since you tagged this question with php, I'm assuming you want to use php for this. PHP has a built-in levenshtein() function but both strings have to be 255 characters or less. If that's not long enough you'll have to make your own. Alternatively, you investigate using Python's difflib.

You should check out this book:

http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

Has a good chapter (3.3) on spell checking

The references at the end of the chapter lists some papers that discuss probabilistic models

Good luck

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!