I have several rows of data (tab separated). I want to find the row which matches elements from two columns (3rd & 4th) in each row with two other colum
The below function compare takes advantage of R´s capability for fast sorting. Function arguments a and b are matrices; rows in a are screend for matching rows in b for any number of columns. In case column order is irrelevant, set row_order=TRUE to have the row entries sorted in increasing order. Guess the function should work as well with dataframes and character / factors columns, as well as duplicate entries in a and/or b. Despite using the for & while it´s relatively quick in returning the first row match in b for each row of a (or 0, if no match is found).
compare<-function(a,b,row_order=TRUE){
len1<-dim(a)[1]
len2<-dim(b)[1]
if(row_order){
a<-t(apply(t(a), 2, sort))
b<-t(apply(t(b), 2, sort))
}
ord1<-do.call(order, as.data.frame(a))
ord2<-do.call(order, as.data.frame(b))
a<-a[ord1,]
b<-b[ord2,]
found<-rep(0,len1)
dims<-dim(a)[2]
do_dims<-c(1:dim(a)[2])
at<-1
for(i in 1:len1){
for(m in do_dims){
while(b[at,m]len2){break}
}
if(at>len2){break}
if(b[at,m]>a[i,m]){break}
if(m==dims){found[i]<-at}
}
if(at>len2){break}
}
return(found[order(ord1)]) # indicates the first match of a found in b and zero otherwise
}
# example data sets:
a <- matrix(sample.int(1E4,size = 1E4, replace = T), ncol = 4)
b <- matrix(sample.int(1E4,size = 1E4, replace = T), ncol = 4)
b <- rbind(a,b) # example of b containing a
# run the function
found<-compare(a,b,row_order=TRUE)
# check
all(found>0)
# rows in a not contained in b (none in this example):
a[found==0,]