I have several rows of data (tab separated). I want to find the row which matches elements from two columns (3rd & 4th) in each row with two other colum
You have not indicated what you would consider a correct answer and your terminology seems a bit vague when you talk about "where there is a reciprocal match", but if I understand the task correctly as finding all rows where col.3 == col.10 & col.4 == col.11, then this should accomplish the task:
which( outer(indat$V4, indat$V11, "==") &
outer(indat$V3, indat$V10, "=="),
arr.ind=TRUE)
# result
row col
[1,] 19 1
[2,] 10 3
[3,] 7 6
[4,] 8 6
[5,] 6 7
[6,] 11 8
[7,] 3 10
[8,] 7 11
[9,] 8 11
[10,] 1 19
The outer function applies a function 'FUN', in this case "==", to all two-way combinations of x and y, its first and second arguments, so here we get an n x n matrix with logical entries and I am taking the logical 'and' of two such matrices. So the rows where there are matches with other rows are:
unique( c(which( outer(indat$V4, indat$V11, "==") &
outer(indat$V3, indat$V10, "=="),
arr.ind=TRUE) ))
#[1] 19 10 7 8 6 11 3 1
So the set with no matches, assuming a data.frame named indat, is:
matches <- unique( c(which( outer(indat$V4, indat$V11, "==") &
outer(indat$V3, indat$V10, "=="), arr.ind=TRUE) ))
indat[ ! 1:NROW(indat) %in% matches, ]
And the ones with matches are:
indat[ 1:NROW(indat) %in% matches, ]
The below function compare
takes advantage of R´s capability for fast sorting. Function arguments a
and b
are matrices; rows in a
are screend for matching rows in b
for any number of columns. In case column order is irrelevant, set row_order=TRUE
to have the row entries sorted in increasing order. Guess the function should work as well with dataframes and character / factors columns, as well as duplicate entries in a
and/or b
. Despite using the for
& while
it´s relatively quick in returning the first row match in b
for each row of a
(or 0
, if no match is found).
compare<-function(a,b,row_order=TRUE){
len1<-dim(a)[1]
len2<-dim(b)[1]
if(row_order){
a<-t(apply(t(a), 2, sort))
b<-t(apply(t(b), 2, sort))
}
ord1<-do.call(order, as.data.frame(a))
ord2<-do.call(order, as.data.frame(b))
a<-a[ord1,]
b<-b[ord2,]
found<-rep(0,len1)
dims<-dim(a)[2]
do_dims<-c(1:dim(a)[2])
at<-1
for(i in 1:len1){
for(m in do_dims){
while(b[at,m]<a[i,m]){
at<-(at+1)
if(at>len2){break}
}
if(at>len2){break}
if(b[at,m]>a[i,m]){break}
if(m==dims){found[i]<-at}
}
if(at>len2){break}
}
return(found[order(ord1)]) # indicates the first match of a found in b and zero otherwise
}
# example data sets:
a <- matrix(sample.int(1E4,size = 1E4, replace = T), ncol = 4)
b <- matrix(sample.int(1E4,size = 1E4, replace = T), ncol = 4)
b <- rbind(a,b) # example of b containing a
# run the function
found<-compare(a,b,row_order=TRUE)
# check
all(found>0)
# rows in a not contained in b (none in this example):
a[found==0,]
DWin's answer is solid, however with large-ish arrays, typically over 50k or so you'll run into memory issues, as the matrices you're creating are huge.
I'd do something like:
match(
interaction( indat$V3, indat$V10),
interaction( indat$V4, indat$V11)
);
Which concatenates all the values of interest into factors and does a match.
This is a less pure solution, but faster/more manageable.