Super fuzzy name checking?

你离开我真会死。 提交于 2019-12-02 17:46:28

I implemented such a functionality on one website. I use double_metaphone() + levenstein() in PHP. I precalculate a double_metaphone() for each entry in the dabatase, which I lookup using a SELECT of the first x chars of the 'metaphoned' searched term.

Then I sort the returned result according to their levenstein distance. double_metaphone() is not part of any PHP library (last time I checked), so I borrowed a PHP implementation I found somewhere a long while ago on the net (site no longer on line). I should post it somewhere I suppose.

EDIT: The website is still in archive.org: http://web.archive.org/web/20080728063208/http://swoodbridge.com/DoubleMetaPhone/

or Google cache: http://webcache.googleusercontent.com/search?q=cache:Tr9taWl9hMIJ:swoodbridge.com/DoubleMetaPhone/+Stephen+Woodbridge+double_metaphon

which leads to many other useful links with source code for double_metaphone(), including one in Javascript on github: http://github.com/maritz/js-double-metaphone

EDIT: Went through my old code, and here are roughly the steps of what I do, pseudo coded to keep it clear:

1) Precompute a double_metaphone() for every word in the database, i.e., $word='blahblah'; $soundslike=double_metaphone($word);

2) At lookup time, $word is fuzzy-searched against the database: $soundslike = double_metaphone($word)

4) SELECT * FROM table WHERE soundlike LIKE $soundlike (if you have levenstein stored as a procedure, much better: SELECT * FROM table WHERE levenstein(soundlike,$soundlike) < mythreshold ORDER BY levenstein(word,$word) ASC LIMIT ... etc.

It has worked well for me, although I can't use a stored procedure, since I have no control over the server and it's using MySQL 4.20 or something.

stimms

I asked a similar question once. Name Hypocorism List I never did get around to doing anything with it but the problem has come up again at work so I might write and open source a library in .net for doing some matching.

Update: I ported the perl module I mentioned there to C# and put it up on github. http://github.com/stimms/Nicknames

Implement the Levenshtein distance:

http://en.wikipedia.org/wiki/Levenshtein_distance

This can be written as a SQL Function and queried many different ways.

Well SSIS has some fuzzy logic tasks we use to find duplicates after the fact.

I think though you need to have your logic look at more than just the name for best results. If they are putting in address, email or phone information, perhaps you could look for people with the same last name with one or more of those other matches and ask if one of them will do. You could also make a table of nicknames for various names and match on that. You won't get all of them, but you could get some of the most common in your country at least.

You can use SOUNDEX to get similar sounding names. However, it won't match with William and Bill for example.

Try this in SQL as an example.

SELECT SOUNDEX('John'), SOUNDEX('Jon')

There is some built-in SOUNDS LIKE functionality in SQL Server, see SOUNDEX http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx

As for full / nickname searching there isn't anything built it that I am aware of. Nicknames vary by region and it's a lot of information to keep track of. There might be a database linking full names to nicknames that you could leverage in your own application.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!