Efficient string similarity grouping

后端未结

关注

 9  842

滥情空心 2020-11-30 11:17

Setting: I have data on people, and their parent\'s names, and I want to find siblings (people with identical parent names).

 pdata<-dat


      
      
        
          9条回答        

        
                    
            
            
                         
                
              
              
                
                   独厮守ぢ
                                             
                
                
                (楼主)
            
              
              
                2020-11-30 12:01
              

            
            
                        
What I have used to reduce the permutations involved in this sort of name matching, is create a function that counts the syllables in the name (surname) involved.  Then store this in the database, as a pre-processed value.  This becomes a Syllable Hash function.

Then you can choose to group words together with the same number of syllables as each other.  (Although I use algorithms that allow 1 or 2 syllables difference, which may be presented as legitimate spelling / typo errors...But my research has found that 95% of misspellings share the same number of syllables)

In this case Peter and Pieter would have the same syllable count (2), but Jones and Smith do not (they have 1).  (For example)

If your function does not get 1 syllable for Jones, then you may need to increase your tolerance to allow for at least 1 syllable difference in the Syllable Hash function grouping that you use. (To account for incorrect syllable function results, and to catch the matching surname correctly in the grouping)

My syllable counting function may not apply completely - as you might need to cope with non-English letter sets...(So I have not pasted the code...Its in C anyway)  Mind you - the Syllable count function does not have to be accurate in terms of TRUE syllable count; it simply needs to act as a reliable Hashing function - which it does.  Far superior to SoundEx which relies on the first letter being accurate.

Give it a go, you might be surprised how much improvement you get by implementing a Syllable Hash function.  You may have to ask SO for help getting the function into your language.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它9个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复