R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )

后端 未结 3 1358
情深已故
情深已故 2020-12-10 16:10

I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table\'s

相关标签:
3条回答
  • 2020-12-10 16:26

    I would clean your bIDs some more before merging. If you know all the weird ways in which the bIDs have been formatted then it should be straightforward to clean them up using gsub().

    In your example, to remove the brackets I would do something like

    expr <- '\\((.*)\\)'
    b$bID <- gsub(expr, replace='\\1', b$bID)
    

    In expr there's a few things going on. Firstly there is .* which is regexp for any character any number of times. Wrapping this in brackets lets gsub know that we want to keep it and can refer to it in the replace expression. In order to use left and right brackets as actually characters we need to escape them with double backslashes. Putting all this together would read as; I want to keep everything between a left bracket and a right bracket.

    Note that you can do fancy things with your replace expression such as replace='id_\\1'.

    In regards to finding an ID within a number sequence you would have to try substring matching or something, but I dont consider that a good approach.

    Hope this helps.

    0 讨论(0)
  • 2020-12-10 16:32

    This is an answer using data.table, inspired by @nograpes.

    ## Create example tables; I added the sarcoline cases
    ##   so there would be examples of rows in a but not b
    a <- data.table(aID=c("1234","1234","4567","6789","3645","321", "321"),
                    aInfo=c("blue","blue2","green","goldenrod","cerulean",
                            "sarcoline","sarcoline2"),
                    key="aID")
    b <- data.table(bID=c("4567","(1234)","6789","23645","63528973"),
                    bInfo=c("apple","banana","kiwi","pomegranate","lychee"),
                    key="bID")
    
    ## Use agrep to get the rows of b by each aID from a
    ab <- a[, b[agrep(aID, bID)], by=.(aID, aInfo)]
    ab
    ##     aID     aInfo    bID       bInfo
    ## 1: 1234      blue (1234)      banana
    ## 2: 1234     blue2 (1234)      banana
    ## 3: 3645  cerulean  23645 pomegranate
    ## 4: 4567     green   4567       apple
    ## 5: 6789 goldenrod   6789        kiwi
    

    So far we've only got an inner join, so now let's add the unmatched rows from the original tables:

    ab <- rbindlist(list(ab, a[!ab[, unique(aID)]], b[!ab[, unique(bID)]]), fill=TRUE)
    

    These steps are optional and are included to match the output from the OP:

    ## Update NA values of aID with the value from bID
    ab[is.na(aID), aID:=bID]
    
    ## Drop the bID column
    ab[, bID:=NULL]
    

    Final result

    ab
    ##         aID      aInfo       bInfo
    ## 1:     1234       blue      banana
    ## 2:     1234      blue2      banana
    ## 3:     3645   cerulean pomegranate
    ## 4:     4567      green       apple
    ## 5:     6789  goldenrod        kiwi
    ## 6:      321  sarcoline          NA
    ## 7:      321 sarcoline2          NA
    ## 8: 63528973         NA      lychee
    
    0 讨论(0)
  • 2020-12-10 16:41

    Doing merge on a condition is a little tricky. I don't think you can do it with merge as it is written, so you end up having to write a custom function with by. It is pretty inefficient, but then, so is merge. If you have millions of rows, consider data.table. This is how you would do a "inner join" where only rows that match are returned.

    # I slightly modified your data to test multiple matches    
    a<-data.frame(aID=c("1234","1234","4567","6789","3645"),aInfo=c("blue","blue2","green","goldenrod","cerulean"))
    b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))
    
    f<-function(x) merge(x,b[agrep(x$aID[1],b$bID),],all=TRUE)
    do.call(rbind,by(a,a$aID,f))
    
    #         aID     aInfo    bID       bInfo
    # 1234.1 1234      blue (1234)      banana
    # 1234.2 1234     blue2 (1234)      banana
    # 3645   3645  cerulean  23645 pomegranate
    # 4567   4567     green   4567       apple
    # 6789   6789 goldenrod   6789        kiwi
    

    Doing a full join is a little trickier. This is one way, that is still inefficient:

    f<-function(x,b) {
      matches<-b[agrep(x[1,1],b[,1]),]
      if (nrow(matches)>0) merge(x,matches,all=TRUE)
      # Ugly... but how else to create a data.frame full of NAs?
      else merge(x,b[NA,][1,],all.x=TRUE)
    }
    d<-do.call(rbind,by(a,a$aID,f,b))
    left.over<-!(b$bID %in% d$bID)
    rbind(d,do.call(rbind,by(b[left.over,],'bID',f,a))[names(d)])
    
    #         aID     aInfo      bID       bInfo
    # 1234.1 1234      blue   (1234)      banana
    # 1234.2 1234     blue2   (1234)      banana
    # 3645   3645  cerulean    23645 pomegranate
    # 4567   4567     green     4567       apple
    # 6789   6789 goldenrod     6789        kiwi
    # bID    <NA>      <NA> 63528973      lychee
    
    0 讨论(0)
提交回复
热议问题