R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )

后端 未结 3 1415
情深已故
情深已故 2020-12-10 16:10

I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table\'s

3条回答
  •  忘掉有多难
    2020-12-10 16:32

    This is an answer using data.table, inspired by @nograpes.

    ## Create example tables; I added the sarcoline cases
    ##   so there would be examples of rows in a but not b
    a <- data.table(aID=c("1234","1234","4567","6789","3645","321", "321"),
                    aInfo=c("blue","blue2","green","goldenrod","cerulean",
                            "sarcoline","sarcoline2"),
                    key="aID")
    b <- data.table(bID=c("4567","(1234)","6789","23645","63528973"),
                    bInfo=c("apple","banana","kiwi","pomegranate","lychee"),
                    key="bID")
    
    ## Use agrep to get the rows of b by each aID from a
    ab <- a[, b[agrep(aID, bID)], by=.(aID, aInfo)]
    ab
    ##     aID     aInfo    bID       bInfo
    ## 1: 1234      blue (1234)      banana
    ## 2: 1234     blue2 (1234)      banana
    ## 3: 3645  cerulean  23645 pomegranate
    ## 4: 4567     green   4567       apple
    ## 5: 6789 goldenrod   6789        kiwi
    

    So far we've only got an inner join, so now let's add the unmatched rows from the original tables:

    ab <- rbindlist(list(ab, a[!ab[, unique(aID)]], b[!ab[, unique(bID)]]), fill=TRUE)
    

    These steps are optional and are included to match the output from the OP:

    ## Update NA values of aID with the value from bID
    ab[is.na(aID), aID:=bID]
    
    ## Drop the bID column
    ab[, bID:=NULL]
    

    Final result

    ab
    ##         aID      aInfo       bInfo
    ## 1:     1234       blue      banana
    ## 2:     1234      blue2      banana
    ## 3:     3645   cerulean pomegranate
    ## 4:     4567      green       apple
    ## 5:     6789  goldenrod        kiwi
    ## 6:      321  sarcoline          NA
    ## 7:      321 sarcoline2          NA
    ## 8: 63528973         NA      lychee
    

提交回复
热议问题