match two columns with two other columns

后端 未结 3 1062
既然无缘
既然无缘 2020-12-10 13:47

I have several rows of data (tab separated). I want to find the row which matches elements from two columns (3rd & 4th) in each row with two other colum

相关标签:
3条回答
  • 2020-12-10 14:29

    You have not indicated what you would consider a correct answer and your terminology seems a bit vague when you talk about "where there is a reciprocal match", but if I understand the task correctly as finding all rows where col.3 == col.10 & col.4 == col.11, then this should accomplish the task:

    which( outer(indat$V4, indat$V11, "==") & 
           outer(indat$V3, indat$V10, "=="), 
           arr.ind=TRUE)
    # result
          row col
     [1,]  19   1
     [2,]  10   3
     [3,]   7   6
     [4,]   8   6
     [5,]   6   7
     [6,]  11   8
     [7,]   3  10
     [8,]   7  11
     [9,]   8  11
    [10,]   1  19
    

    The outer function applies a function 'FUN', in this case "==", to all two-way combinations of x and y, its first and second arguments, so here we get an n x n matrix with logical entries and I am taking the logical 'and' of two such matrices. So the rows where there are matches with other rows are:

    unique( c(which( outer(indat$V4, indat$V11, "==") & 
    outer(indat$V3, indat$V10, "=="), 
    arr.ind=TRUE) ))
    
    #[1] 19 10  7  8  6 11  3  1
    

    So the set with no matches, assuming a data.frame named indat, is:

    matches <- unique( c(which( outer(indat$V4, indat$V11, "==") & 
                          outer(indat$V3, indat$V10, "=="), arr.ind=TRUE) ))
    indat[ ! 1:NROW(indat) %in% matches, ]
    

    And the ones with matches are:

    indat[ 1:NROW(indat) %in% matches, ]
    
    0 讨论(0)
  • 2020-12-10 14:51

    The below function compare takes advantage of R´s capability for fast sorting. Function arguments a and b are matrices; rows in a are screend for matching rows in b for any number of columns. In case column order is irrelevant, set row_order=TRUE to have the row entries sorted in increasing order. Guess the function should work as well with dataframes and character / factors columns, as well as duplicate entries in a and/or b. Despite using the for & while it´s relatively quick in returning the first row match in b for each row of a (or 0, if no match is found).

    compare<-function(a,b,row_order=TRUE){
    
        len1<-dim(a)[1]
        len2<-dim(b)[1]
        if(row_order){
            a<-t(apply(t(a), 2, sort))
            b<-t(apply(t(b), 2, sort))
        }
        ord1<-do.call(order, as.data.frame(a))
        ord2<-do.call(order, as.data.frame(b))
        a<-a[ord1,]
        b<-b[ord2,] 
        found<-rep(0,len1)  
        dims<-dim(a)[2]
        do_dims<-c(1:dim(a)[2])
        at<-1
        for(i in 1:len1){
            for(m in do_dims){
                while(b[at,m]<a[i,m]){
                    at<-(at+1)      
                    if(at>len2){break}              
                }
                if(at>len2){break}
                if(b[at,m]>a[i,m]){break}
                if(m==dims){found[i]<-at}
            }
            if(at>len2){break}
        }
        return(found[order(ord1)]) # indicates the first match of a found in b and zero otherwise
    
    }
    
    
    # example data sets:
    a <- matrix(sample.int(1E4,size = 1E4, replace = T), ncol = 4)
    b <- matrix(sample.int(1E4,size = 1E4, replace = T), ncol = 4)
    b <- rbind(a,b) # example of b containing a
    
    
    # run the function
    found<-compare(a,b,row_order=TRUE)
    # check
    all(found>0) 
    # rows in a not contained in b (none in this example):
    a[found==0,]
    
    0 讨论(0)
  • 2020-12-10 14:52

    DWin's answer is solid, however with large-ish arrays, typically over 50k or so you'll run into memory issues, as the matrices you're creating are huge.

    I'd do something like:

    match(
      interaction( indat$V3, indat$V10),
      interaction( indat$V4, indat$V11)
    );
    

    Which concatenates all the values of interest into factors and does a match.

    This is a less pure solution, but faster/more manageable.

    0 讨论(0)
提交回复
热议问题