Fuzzy string matching in Python

前端未结

关注

 3  1379

暗喜 2020-12-23 23:32

I have 2 lists of over a million names with slightly different naming conventions. The goal here it to match those records that are similar, with the logic of 95% confidence

3条回答

难免孤独 (楼主)

2020-12-24 00:03
You have to index, or normalize the strings to avoid the O(n^2) run. Basically, you have to map each string to a normal form, and to build a reverse dictionary with all the words linked to corresponding normal forms.

Let's consider that normal forms of 'world' and 'word' are the same. So, first build a reversed dictionary of Normalized -> [word1, word2, word3], e.g.:
```
"world" <-> Normalized('world')
"word"  <-> Normalized('wrd')

to:

Normalized('world') -> ["world", "word"]
```
There you go - all the items (lists) in the Normalized dict which have more than one value - are the matched words.

The normalization algorithm depends on data i.e. the words. Consider one of the many:
- Soundex
- Metaphone
- Double Metaphone
- NYSIIS
- Caverphone
- Cologne Phonetic
- MRA codex
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...