Comparing similarity between multiple strings with a random starting point

↘锁芯ラ 提交于 2019-12-13 05:17:02

问题


I have a bunch of people names that are tied to their respective Identifying Numbers (e.g. Social Security Number/National ID/Passport Number). Due to duplication though, one Identity Number can have upto 100 names which could be similar or totally different. E.g. ID 221 could have the names Richard Parker, Mary Parker, Aunt May, Parker Richard, M@rrrrryy Richard etc etc. Some typos but some totally different names.

Initially, I want to display only 3 (or a similar small number) of the names that are as different as possible from the rest so as to alert that viewer that the multiple names could not be typos but could be even a case of identity theft or negligent data capture or anything else!

I've read up on an algorithm to detect similarity and am currently looking at this one which would allow you to compute a score and a score of 1 means the two strings are the same while a lower score means they are dissimilar. In my use case, how can I go through say the 100 names and display the 3 that are most dissimilar? The algorithm for that just escapes my mind as I feel like I need a starting point and then look and compare among all others and loop again etc etc


回答1:


Take the function from https://stackoverflow.com/a/14631287/1082673 as you mentioned and iterate over all combinations in your list. This will work if you have not that many entries, otherwise the computation time can increase pretty fast…

Here is how to generate the pairs for a given list:

import itertools

persons = ['person1', 'person2', 'person3']

for p1, p2 in itertools.combinations(persons, 2):
    print "Compare", p1, "and", p2


来源:https://stackoverflow.com/questions/18689026/comparing-similarity-between-multiple-strings-with-a-random-starting-point

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!