R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )

后端未结

关注

 3  1415

情深已故 2020-12-10 16:10

I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table\'s

3条回答

忘掉有多难 (楼主)

2020-12-10 16:32

This is an answer using data.table, inspired by @nograpes.

## Create example tables; I added the sarcoline cases
##   so there would be examples of rows in a but not b
a <- data.table(aID=c("1234","1234","4567","6789","3645","321", "321"),
                aInfo=c("blue","blue2","green","goldenrod","cerulean",
                        "sarcoline","sarcoline2"),
                key="aID")
b <- data.table(bID=c("4567","(1234)","6789","23645","63528973"),
                bInfo=c("apple","banana","kiwi","pomegranate","lychee"),
                key="bID")

## Use agrep to get the rows of b by each aID from a
ab <- a[, b[agrep(aID, bID)], by=.(aID, aInfo)]
ab
##     aID     aInfo    bID       bInfo
## 1: 1234      blue (1234)      banana
## 2: 1234     blue2 (1234)      banana
## 3: 3645  cerulean  23645 pomegranate
## 4: 4567     green   4567       apple
## 5: 6789 goldenrod   6789        kiwi

So far we've only got an inner join, so now let's add the unmatched rows from the original tables:

ab <- rbindlist(list(ab, a[!ab[, unique(aID)]], b[!ab[, unique(bID)]]), fill=TRUE)

These steps are optional and are included to match the output from the OP:

## Update NA values of aID with the value from bID
ab[is.na(aID), aID:=bID]

## Drop the bID column
ab[, bID:=NULL]

Final result

ab
##         aID      aInfo       bInfo
## 1:     1234       blue      banana
## 2:     1234      blue2      banana
## 3:     3645   cerulean pomegranate
## 4:     4567      green       apple
## 5:     6789  goldenrod        kiwi
## 6:      321  sarcoline          NA
## 7:      321 sarcoline2          NA
## 8: 63528973         NA      lychee

0 讨论(0)

查看其它3个回答