All-to-all setdiff on two numeric vectors with a numeric threshold for accepting matches

后端 未结 3 1612
被撕碎了的回忆
被撕碎了的回忆 2020-12-20 23:49

What I want to do is more or less a combination of the problems discussed in the two following threads:

  • Perform non-pairwise all-to-all comparisons between two
相关标签:
3条回答
  • 2020-12-21 00:10

    If you are happy to use a non-base package, data.table::inrange is a convenient function.

    x1[!inrange(x1, x2 - 0.045, x2 + 0.045)]
    # [1] 1002.570  301.569
    
    x2[!inrange(x2, x1 - 0.045, x1 + 0.045)]
    # [1]   22.12   53.00 5666.31  100.10
    

    inrange is also efficient on larger data sets. On e.g. 1e5 vectors, inrange is > 700 times faster than the two other alternatives:

    n <- 1e5
    b1 <- runif(n, 0, 10000)
    b2 <- b1 + runif(n, -1, 1)
    
    microbenchmark(
      f1 = f(b1, b2, 0.045, 5000),
      f2 = list(in_b1_not_in_b2 = b1[sapply(b1, function(x) !any(abs(x - b2) <= 0.045))],
           in_b2_not_in_b1 = b2[sapply(b2, function(x) !any(abs(x - b1) <= 0.045))]),
      f3 = list(in_b1_not_in_b2 = b1[!inrange(b1, b2 - 0.045, b2 + 0.045)],
           in_b2_not_in_b1 = b2[!inrange(b2, b1 - 0.045, b1 + 0.045)]),
      unit = "relative", times = 10)
    # Unit: relative
    #  expr      min       lq     mean   median        uq       max neval
    #    f1 1976.931 1481.324 1269.393 1103.567 1173.3017 1060.2435    10
    #    f2 1347.114 1027.682  858.908  766.773  754.7606  700.0702    10
    #    f3    1.000    1.000    1.000    1.000    1.0000    1.0000    10
    

    And yes, they give the same result:

    n <- 100
    b1 <- runif(n, 0, 10000)
    b2 <- b1 + runif(n, -1, 1)
    
    all.equal(f(b1, b2, 0.045, 5000),
              list(in_b1_not_in_b2 = b1[sapply(b1, function(x) !any(abs(x - b2) <= 0.045))],
                   in_b2_not_in_b1 = b2[sapply(b2, function(x) !any(abs(x - b1) <= 0.045))]))
    # TRUE
    
    all.equal(f(b1, b2, 0.045, 5000),
              list(in_b1_not_in_b2 = b1[!inrange(b1, b2 - 0.045, b2 + 0.045)],
                   in_b2_not_in_b1 = b2[!inrange(b2, b1 - 0.045, b1 + 0.045)]))
    # TRUE
    

    Several related, potentially useful answers when searching for inrange on SO.

    0 讨论(0)
  • 2020-12-21 00:13

    Here is an alternative approach

    in_b1_not_in_b2 <- b_1[sapply(b_1, function(x) !any(abs(x - b_2) <= 0.045))]
    in_b1_not_in_b2
    #[1] 1002.570  301.569
    
    in_b2_not_in_b1 <- b_2[sapply(b_2, function(x) !any(abs(x - b_1) <= 0.045))]
    in_b2_not_in_b1
    #[1]   22.12   53.00 5666.31  100.10
    
    0 讨论(0)
  • 2020-12-21 00:18

    A vectorized beast:

    D <- abs(outer(b_1, b_2, "-")) > 0.045
    
    in_b1_not_in_b2 <- b_1[rowSums(D) == length(b_2)]
    #[1] 1002.570  301.569
    
    in_b2_not_in_b1 <- b_2[colSums(D) == length(b_1)]
    #[1]   22.12   53.00 5666.31  100.10
    

    hours later...

    Henrik shared a question complaining the memory explosion when using outer for long vectors: Matching two very very large vectors with tolerance (fast! but working space sparing). However, the memory bottleneck for outer can be easily killed by blocking.

    f <- function (b1, b2, threshold, chunk.size = 5000) {
    
      n1 <- length(b1)
      n2 <- length(b2)
      chunk.size <- min(chunk.size, n1, n2)
    
      RS <- numeric(n1)  ## rowSums, to be accumulated
      CS <- numeric(n2)  ## colSums, to be accumulated
    
      j <- 0
      while (j < n2) {
        chunk.size_j <- min(chunk.size, n2 - j)
        ind_j <- (j + 1):(j + chunk.size_j)
        b2_j <- b2[ind_j]
        i <- 0
        while (i < n1) {
          chunk.size_i <- min(chunk.size, n1 - i)
          ind_i <- (i + 1):(i + chunk.size_i)
          M <- abs(outer(b1[ind_i], b2_j, "-")) > threshold
          RS[ind_i] <- RS[ind_i] + rowSums(M)
          CS[ind_j] <- CS[ind_j] + colSums(M)
          i <- i + chunk.size_i
          }
        j <- j + chunk.size_j
        }
    
      list(in_b1_not_in_b2 = b1[RS == n2], in_b2_not_in_b1 = b2[CS == n1])
      }
    

    With this function, outer never uses more memory than storing two chunk.size x chunk.size matrices. Now let's do something crazy.

    b1 <- runif(1e+5, 0, 10000)
    b2 <- b1 + runif(1e+5, -1, 1)
    

    If we do a simple outer, we need memory to store two 1e+5 x 1e+5 matrices, which is up to 149 GB. However, on my Sandy Bridge (2011) laptop with only 4 GB RAM, computation is feasible.

    system.time(oo <- f(b1, b2, 0.045, 5000))
    #   user  system elapsed 
    #365.800 167.348 533.912 
    

    The performance is actually good enough, given that we have been using a very poor algorithm.

    All answers here do exhausted search, that has complexity length(b1) x length(b2). We could reduce this to length(b1) + length(b2) if we work on sorted arrays. But such deeply optimized algorithm can only be implemented with compiled language to obtain efficiency.

    0 讨论(0)
提交回复
热议问题