Setting: I have data on people, and their parent\'s names, and I want to find siblings (people with identical parent names).
pdata<-dat
If I get it right, you want to compare every parent pair (every row in parent_name data frame) with all other pairs (rows), and keep rows that have Levenstein distance smaller or equal to 2.
I have written following code for the beginning:
pdata<-data.frame(parents_name=c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else"))
fuzzy_match <- list()
system.time(for (i in 1:nrow(pdata)){
fuzzy_match[[i]] <- cbind(pdata, parents_name_2 = pdata[i,"parents_name"],
dist = as.integer(stringdist(pdata[i,"parents_name"], pdata$parents_name)))
fuzzy_match[[i]] <- fuzzy_match[[i]][fuzzy_match[[i]]$dist <= 2,]
})
fuzzy_final <- do.call(rbind, fuzzy_match)
Does it return what you wanted?