Fuzzy merging in R - seeking help to improve my code

问题

Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the string distance matrices, if there are multiple fuzzy variables, and not the computation of the string distances itself is parallelized). As for the fuzzy_join function the idea is to match for exact variables if possible (to keep the matrices smaller) and then to proceed to fuzzy matching within this exactly matched groups. I actually think that the function is self-explanatory. I am posting it here because I would like to have some feedback to improve it and because I guess that I am not the only one who tries to do stuff like that in R (although I admit that Python, SQL and things like that would probably be more efficient in this context. But one has to stick to the things one feels most comfortable with and doing the data cleaning and preparation in the same language is nice with regard to reproducibility)

merge.fuzzy = function(a,b,.exact,.fuzzy,.weights,.method,.ncores) {
    require(stringdist)
    require(matrixStats)
    require(parallel)

    if (length(.fuzzy)!=length(.weights)) {
        stop(paste0("fuzzy and weigths must have the same length"))
    }

    if (!any(class(a)=="data.table")) {
        stop(paste0("'a' must be of class data.table"))
    }

    if (!any(class(b)=="data.table")) {
        stop(paste0("'b' must be of class data.table"))
    }

    #convert everything to lower
    a[,c(.fuzzy):=lapply(.SD,tolower),.SDcols=.fuzzy]
    b[,c(.fuzzy):=lapply(.SD,tolower),.SDcols=.fuzzy]

    a[,c(.exact):=lapply(.SD,tolower),.SDcols=.exact]
    b[,c(.exact):=lapply(.SD,tolower),.SDcols=.exact]

    #create ids
    a[,"id.a":=as.numeric(.I),by=c(.exact,.fuzzy)]
    b[,"id.b":=as.numeric(.I),by=c(.exact,.fuzzy)]


    c <- unique(rbind(a[,.exact,with=FALSE],b[,.exact,with=FALSE]))
    c[,"exa.id":=.GRP,by=.exact]

    a <- merge(a,c,by=.exact,all=FALSE)
    b <- merge(b,c,by=.exact,all=FALSE)

    ##############

    stringdi <- function(a,b,.weights,.by,.method,.ncores) {
        sdm      <- list()

        if (is.null(.weights)) {.weights <- rep(1,length(.by))}

        if (nrow(a) < nrow(b)) {
            for (i in 1:length(.by)) {
                sdm[[i]] <- stringdistmatrix(a[[.by[i]]],b[[.by[i]]],method=.method,ncores=.ncores,useNames=TRUE)
            }
        } else {
            for (i in 1:length(.by)) { #if a is shorter, switch sides; this enhances  parallelization speed
                sdm[[i]] <- stringdistmatrix(b[[.by[i]]],a[[.by[i]]],method=.method,ncores=.ncores,useNames=FALSE)
            }
        }

        rsdm = dim(sdm[[1]])
        csdm = ncol(sdm[[1]])
        sdm  = matrix(unlist(sdm),ncol=length(by))
        sdm  = rowSums(sdm*.weights,na.rm=T)/((0 + !is.na(sdm)) %*% .weights)
        sdm  = matrix(sdm,nrow=rsdm,ncol=csdm)

        #use ids as row/ column names
        rownames(sdm) <- a$id.a
        colnames(sdm) <- b$id.b

        mid           <- max.col(-sdm,ties.method="first")
        mid           <- matrix(c(1:nrow(sdm),mid),ncol=2)
        bestdis       <- sdm[mid] 

        res           <- data.table(as.numeric(rownames(sdm)),as.numeric(colnames(sdm)[mid[,2]]),bestdis)
        setnames(res,c("id.a","id.b","dist"))

        res
    }

    setkey(b,exa.id)
    distances = a[,stringdi(.SD,b[J(.BY[[1]])],.weights=.weights,.by=.fuzzy,.method=.method,.ncores=.ncores),by=exa.id]

    a    = merge(a,distances,by=c("exa.id","id.a"))
    res  = merge(a,b,by=c("exa.id","id.b"))


    res
}

The following points would be interesting:

I am not quite sure how to code multiple exact matching variables in the data.table style I used above (which I believe is the fasted option).
Is it possible to have nested parallelization? This means is it possible to use a parallel foreach loop on top of the computation of the string distance matrices.
I am also interested in ideas with regard to making the whole thing more efficient, i.e. to consume less memory.
Maybe you can suggest a bigger "real world" data set so that I can create a woking example. Unfortunately I cannot share even small samples of my data with you.
In the future it would also be nice to do something else than a classic left inner join. So also ideas with regard to this topic are very much appreciated.

All your comments are welcome!

来源：https://stackoverflow.com/questions/29449480/fuzzy-merging-in-r-seeking-help-to-improve-my-code

标签

parallel-processing

data.table

fuzzy-comparison

stringdist