I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table\'s
I would clean your bIDs some more before merging. If you know all the weird ways in which the bIDs have been formatted then it should be straightforward to clean them up using gsub()
.
In your example, to remove the brackets I would do something like
expr <- '\\((.*)\\)'
b$bID <- gsub(expr, replace='\\1', b$bID)
In expr
there's a few things going on. Firstly there is .*
which is regexp for any character any number of times. Wrapping this in brackets lets gsub
know that we want to keep it and can refer to it in the replace expression. In order to use left and right brackets as actually characters we need to escape them with double backslashes. Putting all this together would read as; I want to keep everything between a left bracket and a right bracket.
Note that you can do fancy things with your replace expression such as replace='id_\\1'
.
In regards to finding an ID within a number sequence you would have to try substring matching or something, but I dont consider that a good approach.
Hope this helps.
This is an answer using data.table
, inspired by @nograpes.
## Create example tables; I added the sarcoline cases
## so there would be examples of rows in a but not b
a <- data.table(aID=c("1234","1234","4567","6789","3645","321", "321"),
aInfo=c("blue","blue2","green","goldenrod","cerulean",
"sarcoline","sarcoline2"),
key="aID")
b <- data.table(bID=c("4567","(1234)","6789","23645","63528973"),
bInfo=c("apple","banana","kiwi","pomegranate","lychee"),
key="bID")
## Use agrep to get the rows of b by each aID from a
ab <- a[, b[agrep(aID, bID)], by=.(aID, aInfo)]
ab
## aID aInfo bID bInfo
## 1: 1234 blue (1234) banana
## 2: 1234 blue2 (1234) banana
## 3: 3645 cerulean 23645 pomegranate
## 4: 4567 green 4567 apple
## 5: 6789 goldenrod 6789 kiwi
So far we've only got an inner join, so now let's add the unmatched rows from the original tables:
ab <- rbindlist(list(ab, a[!ab[, unique(aID)]], b[!ab[, unique(bID)]]), fill=TRUE)
These steps are optional and are included to match the output from the OP:
## Update NA values of aID with the value from bID
ab[is.na(aID), aID:=bID]
## Drop the bID column
ab[, bID:=NULL]
Final result
ab
## aID aInfo bInfo
## 1: 1234 blue banana
## 2: 1234 blue2 banana
## 3: 3645 cerulean pomegranate
## 4: 4567 green apple
## 5: 6789 goldenrod kiwi
## 6: 321 sarcoline NA
## 7: 321 sarcoline2 NA
## 8: 63528973 NA lychee
Doing merge
on a condition is a little tricky. I don't think you can do it with merge
as it is written, so you end up having to write a custom function with by
. It is pretty inefficient, but then, so is merge
. If you have millions of rows, consider data.table
. This is how you would do a "inner join" where only rows that match are returned.
# I slightly modified your data to test multiple matches
a<-data.frame(aID=c("1234","1234","4567","6789","3645"),aInfo=c("blue","blue2","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))
f<-function(x) merge(x,b[agrep(x$aID[1],b$bID),],all=TRUE)
do.call(rbind,by(a,a$aID,f))
# aID aInfo bID bInfo
# 1234.1 1234 blue (1234) banana
# 1234.2 1234 blue2 (1234) banana
# 3645 3645 cerulean 23645 pomegranate
# 4567 4567 green 4567 apple
# 6789 6789 goldenrod 6789 kiwi
Doing a full join is a little trickier. This is one way, that is still inefficient:
f<-function(x,b) {
matches<-b[agrep(x[1,1],b[,1]),]
if (nrow(matches)>0) merge(x,matches,all=TRUE)
# Ugly... but how else to create a data.frame full of NAs?
else merge(x,b[NA,][1,],all.x=TRUE)
}
d<-do.call(rbind,by(a,a$aID,f,b))
left.over<-!(b$bID %in% d$bID)
rbind(d,do.call(rbind,by(b[left.over,],'bID',f,a))[names(d)])
# aID aInfo bID bInfo
# 1234.1 1234 blue (1234) banana
# 1234.2 1234 blue2 (1234) banana
# 3645 3645 cerulean 23645 pomegranate
# 4567 4567 green 4567 apple
# 6789 6789 goldenrod 6789 kiwi
# bID <NA> <NA> 63528973 lychee