Efficient string similarity grouping

后端 未结 9 864
滥情空心
滥情空心 2020-11-30 11:17

Setting: I have data on people, and their parent\'s names, and I want to find siblings (people with identical parent names).

 pdata<-dat         


        
9条回答
  •  广开言路
    2020-11-30 11:57

    If I get it right, you want to compare every parent pair (every row in parent_name data frame) with all other pairs (rows), and keep rows that have Levenstein distance smaller or equal to 2.

    I have written following code for the beginning:

    pdata<-data.frame(parents_name=c("peter pan + marta steward",
                                     "pieter pan + marta steward",
                                     "armin dolgner + jane johanna dough",
                                     "jack jackson + sombody else"))
    
    fuzzy_match <- list()
    system.time(for (i in 1:nrow(pdata)){
      fuzzy_match[[i]] <- cbind(pdata, parents_name_2 = pdata[i,"parents_name"],
                                dist = as.integer(stringdist(pdata[i,"parents_name"], pdata$parents_name)))
      fuzzy_match[[i]] <- fuzzy_match[[i]][fuzzy_match[[i]]$dist <= 2,]
    })
    fuzzy_final <- do.call(rbind, fuzzy_match)
    

    Does it return what you wanted?

提交回复
热议问题