Matched Range Merge in R

前端 未结 3 1376
甜味超标
甜味超标 2020-12-22 03:55

I would like to merge/combine two files, so that if an entry in column B of my first file falls into the range of columns B and C in my second file, the output will contain

3条回答
  •  被撕碎了的回忆
    2020-12-22 04:23

    I see you've already accepted an answer, but here is another possible solution.

    This function was just hacked together, but could be worked on some more to be made more generalized.

    myfun = function(DATA1, DATA2, MATCH1, MIN, MAX) {
      temp = sapply(1:nrow(DATA2), 
                    function(x) DATA1[[MATCH1]] >= DATA2[[MIN]][x] &
                      DATA1[[MATCH1]] <= DATA2[[MAX]][x])
      if (isTRUE(any(rowSums(temp) == 0))) {
        temp1 = DATA1[-(which(rowSums(temp) == 0)), ]
      }
      OUT = cbind(temp1[order(temp1[[MATCH1]]), ], 
                  DATA2[order(DATA2[[MIN]]), ], row.names=NULL)
      condition = ((OUT[4] <= OUT[2] & OUT[2] <= OUT[5]) == 0)
      if (isTRUE(any(condition))) {
        OUT[-which(condition), ]
      } else {
        OUT
      }
    }
    

    Here's what the function does:

    1. It first compares, row by row, the value in the second column of the first data.frame with the values in the second and third columns of the second data.frame.
    2. It then checks to find if any of those has FALSE for both conditions, and removes them from the first data.frame.
    3. Then, it sorts the first data.frame by the second column, and the second data.frame by the "min" match column.
    4. Finally, it does one more check to ensure that the value from the first dataset is between the provided values; if not, that row is removed.

    Now, here is some sample data. A and B are the same as your provided data. X and Y have been changed for further testing purposes. In the merge between X and Y, there should be only one row.

    A = read.table(header=TRUE, text="A      B
        rs10    23353
        rs100   10000
        rs234   54440")
    
    B = read.table(header=TRUE, text="A        B      C
        E235    20000   30000
        E255    50000   60000")
    
    X = A[c(3, 1, 2), ]
    X[1, 2] = 57000
    Y = B
    Y[2, 3] = 55000
    

    Here's how you would use the function and the output you would get.

    myfun(A, B, 2, 2, 3)
    #       A     B    A     B     C
    # 1  rs10 23353 E235 20000 30000
    # 2 rs234 54440 E255 50000 60000
    myfun(X, Y, 2, 2, 3)
    #      A     B    A     B     C
    # 1 rs10 23353 E235 20000 30000
    

提交回复
热议问题