I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table\'s
This is an answer using data.table, inspired by @nograpes.
## Create example tables; I added the sarcoline cases
## so there would be examples of rows in a but not b
a <- data.table(aID=c("1234","1234","4567","6789","3645","321", "321"),
aInfo=c("blue","blue2","green","goldenrod","cerulean",
"sarcoline","sarcoline2"),
key="aID")
b <- data.table(bID=c("4567","(1234)","6789","23645","63528973"),
bInfo=c("apple","banana","kiwi","pomegranate","lychee"),
key="bID")
## Use agrep to get the rows of b by each aID from a
ab <- a[, b[agrep(aID, bID)], by=.(aID, aInfo)]
ab
## aID aInfo bID bInfo
## 1: 1234 blue (1234) banana
## 2: 1234 blue2 (1234) banana
## 3: 3645 cerulean 23645 pomegranate
## 4: 4567 green 4567 apple
## 5: 6789 goldenrod 6789 kiwi
So far we've only got an inner join, so now let's add the unmatched rows from the original tables:
ab <- rbindlist(list(ab, a[!ab[, unique(aID)]], b[!ab[, unique(bID)]]), fill=TRUE)
These steps are optional and are included to match the output from the OP:
## Update NA values of aID with the value from bID
ab[is.na(aID), aID:=bID]
## Drop the bID column
ab[, bID:=NULL]
Final result
ab
## aID aInfo bInfo
## 1: 1234 blue banana
## 2: 1234 blue2 banana
## 3: 3645 cerulean pomegranate
## 4: 4567 green apple
## 5: 6789 goldenrod kiwi
## 6: 321 sarcoline NA
## 7: 321 sarcoline2 NA
## 8: 63528973 NA lychee