Merging through fuzzy matching of variables in R

前端 未结 2 1305
予麋鹿
予麋鹿 2020-12-14 10:35

I have two dataframes (x & y) where the IDs are student_name, father_name and mother_name. Because of typographical errors (\"n\"

2条回答
  •  无人及你
    2020-12-14 11:08

    The agrep function (part of base R), which does approximate string matching using the Levenshtein edit distance is probably worth trying. Without knowing what your data looks like, I can't really suggest a working solution. But this is a suggestion... It records matches in a separate list (if there are multiple equally good matches, then these are recorded as well). Let's say that your data.frame is called df:

    l <- vector('list',nrow(df))
    matches <- list(mother = l,father = l)
    for(i in 1:nrow(df)){
      father_id <- with(df,which(student_name[i] == father_name))
      if(length(father_id) == 1){
        matches[['father']][[i]] <- father_id
      } else {
        old_father_id <- NULL
        ## try to find the total                                                                                                                                 
        for(m in 10:1){ ## m is the maximum distance                                                                                                             
          father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
          if(length(father_id) == 1 || m == 1){
            ## if we find a unique match or if we are in our last round, then stop                                                                               
            matches[['father']][[i]] <- father_id
            break
          } else if(length(father_id) == 0 && length(old_father_id) > 0) {
            ## if we can't do better than multiple matches, then record them anyway                                                                              
            matches[['father']][[i]] <- old_father_id
            break
          } else if(length(father_id) == 0 && length(old_father_id) == 0) {
            ## if the nearest match is more than 10 different from the current pattern, then stop                                                                
            break
          }
        }
      }
    }
    

    The code for the mother_name would be basically the same. You could even put them together in a loop, but this example is just for the purpose of illustration.

提交回复
热议问题