how to determine if a record in every source, represents the same person

与世无争的帅哥 提交于 2019-12-04 13:18:13

The crux of the problem is to compute one or more measures of distance between each pair of entries and then consider them to be the same when one of the distances is less than a certain acceptable threshold. The key is to setup the analysis and then vary the acceptable distance until you reach what you consider to be the best trade-off between false-positives and false-negatives.

One distance measurement could be phonetic. Another you might consider is the Levenshtein or edit distance between the entires, which would attempt to measure typos.

If you have a reasonable idea of how many persons you should have, then your goal is to find the sweet spot where you are getting about the right number of persons. Make your matching too fuzzy and you'll have too few. Make it to restrictive and you'll have too many.

If you know roughly how many entries a person should have, then you can use that as the metric to see when you are getting close. Or you can divide the number of records into the average number of records for each person and get a rough number of persons that you're shooting for.

If you don't have any numbers to use, then you're left picking out groups of records from your analysis and checking by hand whether they look like the same person or not. So it's guess and check.

I hope that helps.

This sounds like a Customer Data Integration problem. Search on that term and you might find some more information. Also, have a poke around inside The Data Warehousing Institude, and you might find some answers there as well.

Edit: In addition, here's an article that might interest you on spanish phonetic matching.

I've had to do something similar before and what I did was use a double metaphone phonetic search on the names.

Before I compared the names though, I tried to normalize away any name/nickname differences by looking up the name in a nick name table I created. (I populated the table with census data I found online) So people called Bob became Robert, Alex became Alexander, Bill became William, etc.

Edit: Double Metaphone was specifically designed to be better than Soundex and work in languages other than English.

SSIS , try using the Fuzzy Lookup transformation

Just to add some details to solve this issue, I'd found this modules for Postgresql 8.3

You might try to cannonicalise the names by comparing them with a dicionary.
This would allow you to spot some common typos and correct them.

Sounds to me you have a record linkage problem. You can use the references in the link.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!