match two columns with two other columns

后端未结

关注

 3  1073

I have several rows of data (tab separated). I want to find the row which matches elements from two columns (3rd & 4th) in each row with two other colum

相关标签:

3条回答

南旧

2020-12-10 14:29
You have not indicated what you would consider a correct answer and your terminology seems a bit vague when you talk about "where there is a reciprocal match", but if I understand the task correctly as finding all rows where col.3 == col.10 & col.4 == col.11, then this should accomplish the task:
```
which( outer(indat$V4, indat$V11, "==") & 
       outer(indat$V3, indat$V10, "=="), 
       arr.ind=TRUE)
# result
      row col
 [1,]  19   1
 [2,]  10   3
 [3,]   7   6
 [4,]   8   6
 [5,]   6   7
 [6,]  11   8
 [7,]   3  10
 [8,]   7  11
 [9,]   8  11
[10,]   1  19
```
The outer function applies a function 'FUN', in this case "==", to all two-way combinations of x and y, its first and second arguments, so here we get an n x n matrix with logical entries and I am taking the logical 'and' of two such matrices. So the rows where there are matches with other rows are:
```
unique( c(which( outer(indat$V4, indat$V11, "==") & 
outer(indat$V3, indat$V10, "=="), 
arr.ind=TRUE) ))

#[1] 19 10  7  8  6 11  3  1
```
So the set with no matches, assuming a data.frame named indat, is:
```
matches <- unique( c(which( outer(indat$V4, indat$V11, "==") & 
                      outer(indat$V3, indat$V10, "=="), arr.ind=TRUE) ))
indat[ ! 1:NROW(indat) %in% matches, ]
```
And the ones with matches are:
```
indat[ 1:NROW(indat) %in% matches, ]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

死守一世寂寞

2020-12-10 14:51

The below function compare takes advantage of R´s capability for fast sorting. Function arguments a and b are matrices; rows in a are screend for matching rows in b for any number of columns. In case column order is irrelevant, set row_order=TRUE to have the row entries sorted in increasing order. Guess the function should work as well with dataframes and character / factors columns, as well as duplicate entries in a and/or b. Despite using the for & while it´s relatively quick in returning the first row match in b for each row of a (or 0, if no match is found).

compare<-function(a,b,row_order=TRUE){

    len1<-dim(a)[1]
    len2<-dim(b)[1]
    if(row_order){
        a<-t(apply(t(a), 2, sort))
        b<-t(apply(t(b), 2, sort))
    }
    ord1<-do.call(order, as.data.frame(a))
    ord2<-do.call(order, as.data.frame(b))
    a<-a[ord1,]
    b<-b[ord2,] 
    found<-rep(0,len1)  
    dims<-dim(a)[2]
    do_dims<-c(1:dim(a)[2])
    at<-1
    for(i in 1:len1){
        for(m in do_dims){
            while(b[at,m]<a[i,m]){
                at<-(at+1)      
                if(at>len2){break}              
            }
            if(at>len2){break}
            if(b[at,m]>a[i,m]){break}
            if(m==dims){found[i]<-at}
        }
        if(at>len2){break}
    }
    return(found[order(ord1)]) # indicates the first match of a found in b and zero otherwise

}


# example data sets:
a <- matrix(sample.int(1E4,size = 1E4, replace = T), ncol = 4)
b <- matrix(sample.int(1E4,size = 1E4, replace = T), ncol = 4)
b <- rbind(a,b) # example of b containing a


# run the function
found<-compare(a,b,row_order=TRUE)
# check
all(found>0) 
# rows in a not contained in b (none in this example):
a[found==0,]

0 讨论(0)

春和景丽

2020-12-10 14:52
DWin's answer is solid, however with large-ish arrays, typically over 50k or so you'll run into memory issues, as the matrices you're creating are huge.

I'd do something like:
```
match(
  interaction( indat$V3, indat$V10),
  interaction( indat$V4, indat$V11)
);
```
Which concatenates all the values of interest into factors and does a match.

This is a less pure solution, but faster/more manageable.
0 讨论(0)
发布评论:

提交评论
- 加载中...