Efficient string similarity grouping

后端 未结 9 855
滥情空心
滥情空心 2020-11-30 11:17

Setting: I have data on people, and their parent\'s names, and I want to find siblings (people with identical parent names).

 pdata<-dat         


        
9条回答
  •  不知归路
    2020-11-30 12:12

    I faced the same performance issue couple years ago. I had to match people's duplicates based on their typed names. My dataset had 200k names and the matrix approach exploded. After searching for some day about a better method, the method I'm proposing here did the job for me in some minutes:

    library(stringdist)
    
    parents_name <- c("peter pan + marta steward",
                "pieter pan + marta steward",
                "armin dolgner + jane johanna dough", 
                "jack jackson + sombody else")
    
    person_id <- 1:length(parents_name)
    
    family_id <- vector("integer", length(parents_name))
    
    
    #Looping through unassigned family ids
    while(sum(family_id == 0) > 0){
    
      ids <- person_id[family_id == 0]
    
      dists <- stringdist(parents_name[family_id == 0][1], 
                          parents_name[family_id == 0], 
                          method = "lv")
    
      matches <- ids[dists <= 3]
    
      family_id[matches] <- max(family_id) + 1
    }
    
    result <- data.frame(person_id, parents_name, family_id)
    

    That way the while will compare fewer matches on every iteration. From that, you might implement different performance boosters, like filtering the names with the same first letter before comparing, etc.

提交回复
热议问题