R: I have to do Softmatch in String

后端 未结 2 1421
天命终不由人
天命终不由人 2020-12-17 06:16

I have to do softmatch in one column of data frame with the given input string, like

col <- c(\"John Collingson\",\"J Collingson\",\"Dummy Name1\",\"Dummy         


        
2条回答
  •  轮回少年
    2020-12-17 07:00

    agrep is definitely a quick and easy base R solution if you have just a bit of data. If this is just a toy example of a larger data frame, you may be interested in a more durable tool. In the past month, learning about the Levenshtein distance noted by @PaulHiemstra (also in these different questions) led me to the RecordLinkage package. The vignettes leave me wanting more examples of the "soft" or fuzzy" matches, particularly across more than 1 field, but the basic answer to your question could be somthing like:

    library(RecordLinkage)
    col <- data.frame(names1 = c("John Collingson","J Collingson","Dummy Name1","Dummy Name2"))
    inputText <- data.frame(names2 = c("J Collingson"))
    g1 <- compare.linkage(inputText, col, strcmp = T)
    g2 <- epiWeights(g1)
    getPairs(g2, min.weight=0.6) 
    # id          names2 Weight
    # 1  1    J Collingson       
    # 2  2    J Collingson  1.000
    # 3                          
    # 4  1    J Collingson       
    # 5  1 John Collingson  0.815
    
    inputText2 <- data.frame(names2 = c("Jon Collinson"))
    g1 <- compare.linkage(inputText2, col, strcmp = T)
    g2 <- epiWeights(g1)
    getPairs(g2, min.weight=0.6)
    # id          names2    Weight
    # 1  1   Jon Collinson          
    # 2  1 John Collingson 0.9644444
    # 3                             
    # 4  1   Jon Collinson          
    # 5  2    J Collingson 0.7924825
    

    Please start with compare.linkage() or compare.dedup()-- RLBigDataLinkage() or RLBigDataDedup() for large data sets. Hope this helps.

提交回复
热议问题