Merging through fuzzy matching of variables in R

前端 未结 2 1302
予麋鹿
予麋鹿 2020-12-14 10:35

I have two dataframes (x & y) where the IDs are student_name, father_name and mother_name. Because of typographical errors (\"n\"

相关标签:
2条回答
  • 2020-12-14 11:08

    The agrep function (part of base R), which does approximate string matching using the Levenshtein edit distance is probably worth trying. Without knowing what your data looks like, I can't really suggest a working solution. But this is a suggestion... It records matches in a separate list (if there are multiple equally good matches, then these are recorded as well). Let's say that your data.frame is called df:

    l <- vector('list',nrow(df))
    matches <- list(mother = l,father = l)
    for(i in 1:nrow(df)){
      father_id <- with(df,which(student_name[i] == father_name))
      if(length(father_id) == 1){
        matches[['father']][[i]] <- father_id
      } else {
        old_father_id <- NULL
        ## try to find the total                                                                                                                                 
        for(m in 10:1){ ## m is the maximum distance                                                                                                             
          father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
          if(length(father_id) == 1 || m == 1){
            ## if we find a unique match or if we are in our last round, then stop                                                                               
            matches[['father']][[i]] <- father_id
            break
          } else if(length(father_id) == 0 && length(old_father_id) > 0) {
            ## if we can't do better than multiple matches, then record them anyway                                                                              
            matches[['father']][[i]] <- old_father_id
            break
          } else if(length(father_id) == 0 && length(old_father_id) == 0) {
            ## if the nearest match is more than 10 different from the current pattern, then stop                                                                
            break
          }
        }
      }
    }
    

    The code for the mother_name would be basically the same. You could even put them together in a loop, but this example is just for the purpose of illustration.

    0 讨论(0)
  • 2020-12-14 11:20

    This takes a list of common column names, matches based on agrep of all those columns combined, and then if all.x or all.y equals TRUE it appends non-matching records filling in missing columns with NA. Unlike merge, the column names to match on need to be the same in each data frame. The challenge would seem to be setting the agrep options correctly to avoid spurious matches.

      agrepMerge <- function(df1, df2, by, all.x = FALSE, all.y = FALSE, 
        ignore.case = FALSE, value = FALSE, max.distance = 0.1, useBytes = FALSE) {
    
        df1$index <- apply(df1[,by, drop = FALSE], 1, paste, sep = "", collapse = "")
        df2$index <- apply(df2[,by, drop = FALSE], 1, paste, sep = "", collapse = "")
    
        matches <- lapply(seq_along(df1$index), function(i, ...) {
          agrep(df1$index[i], df2$index, ignore.case = ignore.case, value = value,
                max.distance = max.distance, useBytes = useBytes)
        })
    
        df1_match <- rep(1:nrow(df1), sapply(matches, length))
        df2_match <- unlist(matches)
    
        df1_hits <- df1[df1_match,]
        df2_hits <- df2[df2_match,]
    
        df1_miss <- df1[setdiff(seq_along(df1$index), df1_match),]
        df2_miss <- df2[setdiff(seq_along(df2$index), df2_match),]
    
        remove_cols <- colnames(df2_hits) %in% colnames(df1_hits)
    
        df_out <- cbind(df1_hits, df2_hits[,!remove_cols])
    
        if(all.x) {
          missing_cols <- setdiff(colnames(df_out), colnames(df1_miss))
          df1_miss[missing_cols] <- NA
          df_out <- rbind(df_out, df1_miss)
        }
        if(all.x) {
          missing_cols <- setdiff(colnames(df_out), colnames(df2_miss))
          df2_miss[missing_cols] <- NA
          df_out <- rbind(df_out, df2_miss)
        }
        df_out[,setdiff(colnames(df_out), "index")]
    }
    
    0 讨论(0)
提交回复
热议问题