Fastest way to find *the index* of the second (third…) highest/lowest value in vector or column

后端 未结 7 904
青春惊慌失措
青春惊慌失措 2020-12-17 10:04

Fastest way to find the index of the second (third...) highest/lowest value in vector or column ?

i.e. what

sort(x,partial=n-1)[n-1]
         


        
相关标签:
7条回答
  • 2020-12-17 10:35

    library Rfast has implemented the nth element function with return index option, which seems to be faster than all other implementations discussed.

    x <- runif(1e+6)
    
    ind <- 2
    
    which_nth_highest_richie <- function(x, n)
    {
      for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
      which(x == max(x))
    }
    
    which_nth_highest_joris <- function(x, n)
    {
      ux <- unique(x)
      nux <- length(ux)
      which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
    } 
    
    microbenchmark::microbenchmark(
            Rfast = Rfast::nth(x,ind,descending = T,index.return = T),
            order = order(x, decreasing = TRUE)[ind],
            richie = which_nth_highest_richie(x,ind),
            joris = which_nth_highest_joris(x,ind))
    
    Unit: milliseconds
              expr       min        lq      mean    median        uq      max   neval
             Rfast  22.89945  26.03551  31.61163  26.70668  32.07650 105.0016   100
             order 113.54317 116.49898 122.97939 119.44496 124.63646 170.4589   100
            richie  26.69556  27.93143  38.74055  36.16341  44.10246 116.7192   100
             joris 126.52276 138.60153 151.49343 146.55747 155.60709 324.8605   100 
    
    0 讨论(0)
  • 2020-12-17 10:36

    One possible route is to use the index.return argument to sort. I'm not sure if this is fastest though.

    set.seed(21)
    x <- rnorm(10)
    ind <- 2
    sapply(sort(x, index.return=TRUE), `[`, length(x)-ind+1)
    #        x       ix 
    # 1.746222 3.000000
    
    0 讨论(0)
  • 2020-12-17 10:41

    Method: Set all max values to -Inf, then find the indices of the max. No sorting required.

    X <- runif(1e7)
    system.time(
    {
      X[X == max(X)] <- -Inf
      which(X == max(X))
    })
    

    Works with ties and is very fast.

    If you can guarantee no ties, then an even faster version is

    system.time(
    {
      X[which.max(X)] <- -Inf
      which.max(X)
    })
    

    EDIT: As Joris mentioned, this method doesn't scale that well for finding third, fourth, etc., highest values.

    which_nth_highest_richie <- function(x, n)
    {
      for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
      which(x == max(x))
    }
    
    which_nth_highest_joris <- function(x, n)
    {
      ux <- unique(x)
      nux <- length(ux)
      which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
    }
    

    Using x <- runif(1e7) and n = 2, Richie wins

    system.time(which_nth_highest_richie(x, 2))   #about half a second
    system.time(which_nth_highest_joris(x, 2))    #about 2 seconds
    

    For n = 100, Joris wins

    system.time(which_nth_highest_richie(x, 100)) #about 20 seconds, ouch! 
    system.time(which_nth_highest_joris(x, 100))  #still about 2 seconds
    

    The balance point, where they take the same length of time, is about n = 10.

    0 讨论(0)
  • 2020-12-17 10:41

    This is my solution for finding the index of the top N highest values in a vector (not exactly what the OP wanted, but this might help other people)

    index.top.N = function(xs, N=10){
        if(length(xs) > 0) {
        o = order(xs, na.last=FALSE)
        o.length = length(o)
        if (N > o.length) N = o.length
        o[((o.length-N+1):o.length)]
      }
      else {
        0
      }
    }
    
    0 讨论(0)
  • 2020-12-17 10:49

    No ties which() is probably your friend here. Combine the output from the sort() solution with which() to find the index that matches the output from the sort() step.

    > set.seed(1)
    > x <- sample(1000, 250)
    > sort(x,partial=n-1)[n-1]
    [1] 992
    > which(x == sort(x,partial=n-1)[n-1])
    [1] 145
    

    Ties handling The solution above doesn't work properly (and wasn't intended to) if there are ties and the ties are the values that are the ith largest or larger values. We need to take the unique values of the vector before sorting those values and then the above solution works:

    > set.seed(1)
    > x <- sample(1000, 1000, replace = TRUE)
    > length(unique(x))
    [1] 639
    > n <- length(x)
    > i <- which(x == sort(x,partial=n-1)[n-1])
    > sum(x > x[i])
    [1] 0
    > x.uni <- unique(x)
    > n.uni <- length(x.uni)
    > i <- which(x == sort(x.uni, partial = n.uni-1)[n.uni-1])
    > sum(x > x[i])
    [1] 2
    > tail(sort(x))
    [1]  994  996  997  997 1000 1000
    

    order() is also very useful here:

    > head(ord <- order(x, decreasing = TRUE))
    [1] 220 145 209 202 211 163
    

    So the solution here is ord[2] for the index of the 2nd highest/largest element of x.

    Some timings:

    > set.seed(1)
    > X <- sample(1e7, 1e7)
    > system.time({n <- length(X); which(X == sort(X, partial = n-1)[n-1])})
       user  system elapsed 
      0.319   0.058   0.378 
    > system.time({ord <- order(X, decreasing = TRUE); ord[2]})
       user  system elapsed 
     14.578   0.084  14.708 
    > system.time({order(X, decreasing = TRUE)[2]})
       user  system elapsed 
     14.647   0.084  14.779
    

    But as the linked post was getting at and the timings above show, order() is much slower, but both provide the same results:

    > all.equal(which(X == sort(X, partial = n-1)[n-1]), 
    +           order(X, decreasing = TRUE)[2])
    [1] TRUE
    

    And for the ties-handling version:

    foo <- function(x, i) {
        X <- unique(x)
        N <- length(X)
        i <- i-1
        which(x == sort(X, partial = N-i)[N-i])
    }
    
    > system.time(foo(X, 2))
       user  system elapsed 
      1.249   0.176   1.454
    

    So the extra steps slow this solution down a bit, but it is still very competitive with order().

    0 讨论(0)
  • 2020-12-17 10:49

    Use maxN function given by Zach to find the next max value and use which() with arr.ind = TRUE.

    which(x == maxN(x, 4), arr.ind = TRUE)

    Using arr.ind will return index position in any of the above solutions as well and simplify the code.

    0 讨论(0)
提交回复
热议问题