Setting: I have data on people, and their parent\'s names, and I want to find siblings (people with identical parent names).
pdata<-dat
My suggestion is to use a data science approach to identify only similar (same cluster) names to compare using stringdist.
I have modified a little bit the code generating "parents_name" adding more variability in first and second names in a scenario close to reality.
num<-4e6
#Random length
random_l<-round(runif(num,min = 5, max=15),0)
#Random strings in the first and second name
parent_rand_first<-stringi::stri_rand_strings(num, random_l)
order<-sample(1:num, num, replace=F)
parent_rand_second<-parent_rand_first[order]
#Paste first and second name
parents_name<-paste(parent_rand_first," + ",parent_rand_second)
parents_name[1:10]
Here start the real analysis, first extract feature from the names such as global length, length of the first, length of the second one, numeber of vowels and consonansts in both first and second name (and any other of interest).
After that bind all these feature and clusterize the data.frame in a high number of clusters (eg. 1000)
features<-cbind(nchars,nchars_first,nchars_second,nvowels_first,nvowels_second,nconsonants_first,nconsonants_second)
n_clusters<-1000
clusters<-kmeans(features,centers = n_clusters)
Apply stringdistmatrix only inside each cluster (containing similar couple of names)
dist_matrix<-NULL
for(i in 1:n_clusters)
{
cluster_i<-clusters$cluster==i
parents_name<-as.character(parents_name[cluster_i])
dist_matrix[[i]]<-stringdistmatrix(parents_name,parents_name,"lv")
}
In dist_matrix you have the distance beetwen each element in the cluster and you are able to assign the family_id using this distance.
To compute the distance in each cluster (in this example) the code takes approximately 1 sec (depending on the dimension of the cluster), in 15mins all the distances are computed.
WARNING: dist_matrix grow very fast, in your code is better if you will analyze it inside di for loop extracting famyli_id and then you can discard it.
If I get it right, you want to compare every parent pair (every row in parent_name data frame) with all other pairs (rows), and keep rows that have Levenstein distance smaller or equal to 2.
I have written following code for the beginning:
pdata<-data.frame(parents_name=c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else"))
fuzzy_match <- list()
system.time(for (i in 1:nrow(pdata)){
fuzzy_match[[i]] <- cbind(pdata, parents_name_2 = pdata[i,"parents_name"],
dist = as.integer(stringdist(pdata[i,"parents_name"], pdata$parents_name)))
fuzzy_match[[i]] <- fuzzy_match[[i]][fuzzy_match[[i]]$dist <= 2,]
})
fuzzy_final <- do.call(rbind, fuzzy_match)
Does it return what you wanted?
What I have used to reduce the permutations involved in this sort of name matching, is create a function that counts the syllables in the name (surname) involved. Then store this in the database, as a pre-processed value. This becomes a Syllable Hash function.
Then you can choose to group words together with the same number of syllables as each other. (Although I use algorithms that allow 1 or 2 syllables difference, which may be presented as legitimate spelling / typo errors...But my research has found that 95% of misspellings share the same number of syllables)
In this case Peter
and Pieter
would have the same syllable count (2), but Jones
and Smith
do not (they have 1). (For example)
If your function does not get 1 syllable for Jones
, then you may need to increase your tolerance to allow for at least 1 syllable difference in the Syllable Hash function grouping that you use. (To account for incorrect syllable function results, and to catch the matching surname correctly in the grouping)
My syllable counting function may not apply completely - as you might need to cope with non-English letter sets...(So I have not pasted the code...Its in C anyway) Mind you - the Syllable count function does not have to be accurate in terms of TRUE syllable count; it simply needs to act as a reliable Hashing function - which it does. Far superior to SoundEx which relies on the first letter being accurate.
Give it a go, you might be surprised how much improvement you get by implementing a Syllable Hash function. You may have to ask SO for help getting the function into your language.
Making equivalency groups on non transitive relation does not make sense. If A
is like B
and B
is like C
, but A
is not like C
, how would you make families from that? Using something like soundex (that was idea of Neal Fultz, not mine) seems the only meaningful option and it solves your problem with performance too.
it reproduces your output, i guess you will have to decide partial matching criteria, i kept the default agrep ones
pdata$parents_name<-as.character(pdata$parents_name)
x00<-unique(lapply(pdata$parents_name,function(x) agrep(x,pdata$parents_name)))
x=c()
for (i in 1:length(x00)){
x=c(x,rep(i,length(x00[[i]])))
}
pdata$person_id=seq(1:nrow(pdata))
pdata$family_id=x
You are using the stringdist
package anyway, does stringdist::phonetic()
suit your needs? It computes the soundex code for each string, eg:
phonetic(pdata$parents_name)
[1] "P361" "P361" "A655" "J225"
Soundex is a tried-and-true method (almost 100 years old) for hashing names, and that means you don't need to compare every single pair of observations.
You might want to go further and do soundex on first name and last name seperately for father and mother.