Fuzzy string matching in Python

前端 未结 3 1379
暗喜
暗喜 2020-12-23 23:32

I have 2 lists of over a million names with slightly different naming conventions. The goal here it to match those records that are similar, with the logic of 95% confidence

3条回答
  •  难免孤独
    2020-12-24 00:03

    You have to index, or normalize the strings to avoid the O(n^2) run. Basically, you have to map each string to a normal form, and to build a reverse dictionary with all the words linked to corresponding normal forms.

    Let's consider that normal forms of 'world' and 'word' are the same. So, first build a reversed dictionary of Normalized -> [word1, word2, word3], e.g.:

    "world" <-> Normalized('world')
    "word"  <-> Normalized('wrd')
    
    to:
    
    Normalized('world') -> ["world", "word"]
    

    There you go - all the items (lists) in the Normalized dict which have more than one value - are the matched words.

    The normalization algorithm depends on data i.e. the words. Consider one of the many:

    • Soundex
    • Metaphone
    • Double Metaphone
    • NYSIIS
    • Caverphone
    • Cologne Phonetic
    • MRA codex

提交回复
热议问题