R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )

后端 未结 3 1410
情深已故
情深已故 2020-12-10 16:10

I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table\'s

3条回答
  •  误落风尘
    2020-12-10 16:41

    Doing merge on a condition is a little tricky. I don't think you can do it with merge as it is written, so you end up having to write a custom function with by. It is pretty inefficient, but then, so is merge. If you have millions of rows, consider data.table. This is how you would do a "inner join" where only rows that match are returned.

    # I slightly modified your data to test multiple matches    
    a<-data.frame(aID=c("1234","1234","4567","6789","3645"),aInfo=c("blue","blue2","green","goldenrod","cerulean"))
    b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))
    
    f<-function(x) merge(x,b[agrep(x$aID[1],b$bID),],all=TRUE)
    do.call(rbind,by(a,a$aID,f))
    
    #         aID     aInfo    bID       bInfo
    # 1234.1 1234      blue (1234)      banana
    # 1234.2 1234     blue2 (1234)      banana
    # 3645   3645  cerulean  23645 pomegranate
    # 4567   4567     green   4567       apple
    # 6789   6789 goldenrod   6789        kiwi
    

    Doing a full join is a little trickier. This is one way, that is still inefficient:

    f<-function(x,b) {
      matches<-b[agrep(x[1,1],b[,1]),]
      if (nrow(matches)>0) merge(x,matches,all=TRUE)
      # Ugly... but how else to create a data.frame full of NAs?
      else merge(x,b[NA,][1,],all.x=TRUE)
    }
    d<-do.call(rbind,by(a,a$aID,f,b))
    left.over<-!(b$bID %in% d$bID)
    rbind(d,do.call(rbind,by(b[left.over,],'bID',f,a))[names(d)])
    
    #         aID     aInfo      bID       bInfo
    # 1234.1 1234      blue   (1234)      banana
    # 1234.2 1234     blue2   (1234)      banana
    # 3645   3645  cerulean    23645 pomegranate
    # 4567   4567     green     4567       apple
    # 6789   6789 goldenrod     6789        kiwi
    # bID           63528973      lychee
    

提交回复
热议问题