Efficient string similarity grouping

后端 未结 9 842
滥情空心
滥情空心 2020-11-30 11:17

Setting: I have data on people, and their parent\'s names, and I want to find siblings (people with identical parent names).

 pdata<-dat         


        
9条回答
  •  独厮守ぢ
    2020-11-30 12:01

    What I have used to reduce the permutations involved in this sort of name matching, is create a function that counts the syllables in the name (surname) involved. Then store this in the database, as a pre-processed value. This becomes a Syllable Hash function.

    Then you can choose to group words together with the same number of syllables as each other. (Although I use algorithms that allow 1 or 2 syllables difference, which may be presented as legitimate spelling / typo errors...But my research has found that 95% of misspellings share the same number of syllables)

    In this case Peter and Pieter would have the same syllable count (2), but Jones and Smith do not (they have 1). (For example)

    If your function does not get 1 syllable for Jones, then you may need to increase your tolerance to allow for at least 1 syllable difference in the Syllable Hash function grouping that you use. (To account for incorrect syllable function results, and to catch the matching surname correctly in the grouping)

    My syllable counting function may not apply completely - as you might need to cope with non-English letter sets...(So I have not pasted the code...Its in C anyway) Mind you - the Syllable count function does not have to be accurate in terms of TRUE syllable count; it simply needs to act as a reliable Hashing function - which it does. Far superior to SoundEx which relies on the first letter being accurate.

    Give it a go, you might be surprised how much improvement you get by implementing a Syllable Hash function. You may have to ask SO for help getting the function into your language.

提交回复
热议问题