Efficient string similarity grouping

后端未结

关注

 9  855

滥情空心 2020-11-30 11:17

Setting: I have data on people, and their parent\'s names, and I want to find siblings (people with identical parent names).

 pdata<-dat


      
      
        
          9条回答        

        
                    
            
            
                         
                
              
              
                
                   不知归路
                                             
                
                
                (楼主)
            
              
              
                2020-11-30 12:12
              

            
            
                        
I faced the same performance issue couple years ago. I had to match people's duplicates based on their typed names. My dataset had 200k names and the matrix approach exploded. After searching for some day about a better method, the method I'm proposing here did the job for me in some minutes:

library(stringdist)

parents_name <- c("peter pan + marta steward",
            "pieter pan + marta steward",
            "armin dolgner + jane johanna dough", 
            "jack jackson + sombody else")

person_id <- 1:length(parents_name)

family_id <- vector("integer", length(parents_name))


#Looping through unassigned family ids
while(sum(family_id == 0) > 0){

  ids <- person_id[family_id == 0]

  dists <- stringdist(parents_name[family_id == 0][1], 
                      parents_name[family_id == 0], 
                      method = "lv")

  matches <- ids[dists <= 3]

  family_id[matches] <- max(family_id) + 1
}

result <- data.frame(person_id, parents_name, family_id)


That way the while will compare fewer matches on every iteration. From that, you might implement different performance boosters, like filtering the names with the same first letter before comparing, etc.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它9个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复