Setting: I have data on people, and their parent\'s names, and I want to find siblings (people with identical parent names).
pdata<-dat
I faced the same performance issue couple years ago. I had to match people's duplicates based on their typed names. My dataset had 200k names and the matrix approach exploded. After searching for some day about a better method, the method I'm proposing here did the job for me in some minutes:
library(stringdist)
parents_name <- c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else")
person_id <- 1:length(parents_name)
family_id <- vector("integer", length(parents_name))
#Looping through unassigned family ids
while(sum(family_id == 0) > 0){
ids <- person_id[family_id == 0]
dists <- stringdist(parents_name[family_id == 0][1],
parents_name[family_id == 0],
method = "lv")
matches <- ids[dists <= 3]
family_id[matches] <- max(family_id) + 1
}
result <- data.frame(person_id, parents_name, family_id)
That way the while will compare fewer matches on every iteration. From that, you might implement different performance boosters, like filtering the names with the same first letter before comparing, etc.