Create a unique ID by fuzzy matching of names (via agrep using R)

左心房为你撑大大i 提交于 2019-11-30 09:23:39

Here's my shot at it. It's probably not very efficient, but I think it will get the job done. I assume that df$candidates is of class factor.

#fuzzy matches candidate names to other candidate names
#compares each pair of names only once
##by looking at names that have a greater index
matches <- unlist(lapply(1:(length(levels(df[["candidate"]]))-1),
    function(x) {max(x,x + agrep(
        pattern=levels(df[["candidate"]])[x], 
        x=levels(df[["candidate"]])[-seq_len(x)]
    ))}
))
#assigns new levels (omits the last level because that doesn't change)
levels(df[["candidate"]])[-length(levels(df[["candidate"]]))] <- 
    levels(df[["candidate"]])[matches]

Ok, given that the focus is on the efficiency, I'd suggest the following.

First, note that in order of efficiency from first principles we could predict that exact matching will be much faster than grep which will be faster than fuzzy grep. So exact match, then fuzzy grep for the remaining observations.

Second, vectorize and avoid loops. The apply commands aren't necessarily faster, so stick to native vectorization if you can. All the grep commands are natively vectorized, but it's going to be hard to avoid a *ply or loop to compare each element to the vector of others to match to.

Third, use outside information to narrow the problem down. Do fuzzy matching on names only within each city or state, which will dramatically reduce the number of comparisons which must be made, for instance.

You can combine the first and third principles: You might even try exact matching on the first character of each string, then fuzzy matching within that.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!