More efficient strategy for which() or match()

后端 未结 3 1676
情话喂你
情话喂你 2020-12-31 23:30

I have a vector of positive and negative numbers

vec<-c(seq(-100,-1), rep(0,20), seq(1,100))

the vector is larger than the example, and

3条回答
  •  误落风尘
    2021-01-01 00:03

    The solutions offered so far all imply creating a logical(length(vec)) and doing a full or partial scan on this. As you note, the vector is sorted. We can exploit this by doing a binary search. I started thinking I'd be super-clever and implement this in C for even greater speed, but had trouble with debugging the indexing of the algorithm (which is the tricky part!). So I wrote it in R:

    f3 <- function(x) {
        imin <- 1L
        imax <- length(x)
        while (imax >= imin) {
            imid <- as.integer(imin + (imax - imin) / 2)
            if (x[imid] >= 0)
                imax <- imid - 1L
            else
                imin <- imid + 1L
        }
        imax
    }
    

    For comparison with the other suggestions

    f0 <- function(v) length(which(v < 0))
    f1 <- function(v) sum(v < 0)
    f2 <- function(v) which.min(v < 0) - 1L
    

    and for fun

    library(compiler)
    f3.c <- cmpfun(f3)
    

    Leading to

    > vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
    > identical(f0(vec), f1(vec))
    [1] TRUE
    > identical(f0(vec), f2(vec))
    [1] TRUE
    > identical(f0(vec), f3(vec))
    [1] TRUE
    > identical(f0(vec), f3.c(vec))
    [1] TRUE
    > microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec))
    Unit: microseconds
          expr       min        lq     median         uq       max neval
       f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903   100
       f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293   100
       f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889   100
       f3(vec)    51.715    56.050    75.4495    78.5295   100.730   100
     f3.c(vec)    11.612    17.147    28.5570    31.3160    49.781   100
    

    Probably there are some tricky edge cases that I've got wrong! Moving to C, I did

    library(inline)
    f4 <- cfunction(c(x = "numeric"), "
        int imin = 0, imax = Rf_length(x) - 1, imid;
        while (imax >= imin) {
            imid = imin + (imax - imin) / 2;
            if (REAL(x)[imid] >= 0)
                imax = imid - 1;
            else
                imin = imid + 1;
        }
        return ScalarInteger(imax + 1);
    ")
    

    with

    > identical(f3(vec), f4(vec))
    [1] TRUE
    > microbenchmark(f3(vec), f3.c(vec), f4(vec))
    Unit: nanoseconds
          expr   min      lq  median      uq   max neval
       f3(vec) 52096 53192.0 54918.5 55539.0 69491   100
     f3.c(vec) 10924 12233.5 12869.0 13410.0 20038   100
       f4(vec)   553   796.0   893.5  1004.5  2908   100
    

    findInterval came up when a similar question was asked on the R-help list. It is slow but safe, checking that vec is actually sorted and dealing with NA values. If one wants to live on the edge (arguably no worse that implementing f3 or f4) then

    f5.i <- function(v)
        .Internal(findInterval(v, 0 - .Machine$double.neg.eps, FALSE, FALSE))
    

    is nearly as fast as the C implementation, but likely more robust and vectorized (i.e., look up a vector of values in the second argument, for easy range-like calculations).

提交回复
热议问题