Decreasing Partial Sorting

问题

As ?sort said, if the argument partial is not NULL, it is taken to contain indices of elements of the result which are to be placed in their correct positions in the sorted array by partial sorting. You can read Argument “partial” of the sort function in R for detail. So in the case that I need to find the smallest 5 numbers in x <- sample(1:100, 50), then

sort(x, partial = 1:5)[1:5]

will be faster than

sort(x)[1:5]

However, how could I find the largest 5 numbers with partial sorting? Intuitively, I try to use:

sort(x, partial = 1:5, decreasing = T)

but it gets

Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) : unsupported options for partial sorting

Therefore, my question is how to achieve the effect of efficiency in this case.

回答1:

You can take the tail from the sorted vector:

set.seed(42)
x <- sample(1:100, 50)
# sort(x, partial = 1:5)[1:5] ## head

p <- length(x)+1 - (1:5) ## tail
sort(x, partial = p)[p]

If you want you can reverse the result using rev()

回答2:

You might still benefit from the speed boost with something like (assuming numeric data):

-sort(-x, partial = 1:5)[1:5]

Benchmarking:

set.seed(3)
x <- sample(1:100000, 500000, replace = TRUE)

bench::mark(
  snoram = -sort(-x, partial = 1:5)[1:5],
  OP = sort(x, decreasing = TRUE)[1:5],
  sotos_check = x[order(x, decreasing = TRUE)][1:5],
  jogo = {p <- length(x) - 0:4; sort(x, partial = p)[p]}
)
# A tibble: 4 x 14
  expression       min     mean   median      max `itr/sec` mem_alloc  n_gc n_itr total_time result    memory             time     gc               
  <chr>       <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>    <list>             <list>   <list>           
1 snoram        6.87ms   7.77ms   7.43ms  15.04ms     129.     5.72MB     9    34      264ms <int [5]> <Rprofmem [3 x 3]> <bch:tm> <tibble [43 x 3]>
2 OP            17.4ms  18.96ms  18.56ms  24.37ms      52.7    3.81MB     3    21      398ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [24 x 3]>
3 sotos_check  14.65ms  17.07ms  16.48ms  25.58ms      58.6    3.81MB     4    23      393ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [27 x 3]>
4 jogo          4.98ms   5.45ms   5.35ms   8.91ms     184.     3.81MB     6    37      201ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [43 x 3]>

回答3:

You can also use C++'s partial_sort through Rcpp with a file with the following content:

include "Rcpp.h"
#include <algorithm>
using namespace Rcpp;

inline bool rev_comp(double const i, double const j){ 
  return i > j; 
}

// [[Rcpp::export(rng = false)]]
NumericVector cpp_partial_sort(NumericVector x, unsigned const k) {
  if(k >= x.size() or k < 1)
    throw std::invalid_argument("Invalid k");
  if(k + 1 == x.size())
    return x;
  
  NumericVector out = clone(x);
  std::partial_sort(&out[0], &out[k + 1], &out[x.size() - 1], rev_comp);
  return out;
}

We can now confirm that we get the same and make a benchmark:

# simulate data
set.seed(2)
x <- rnorm(10000)

# they all give the same
rk <- 5
setdiff(cpp_partial_sort(x, rk)[1:rk], 
        -sort(-x, partial = 1:rk)[1:rk])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk], 
        sort(x, decreasing = TRUE)[1:5])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk], 
        x[order(x, decreasing = TRUE)][1:rk])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk], 
        { p <- length(x) - 0:(rk - 1); sort(x, partial = p)[p] })
#R> numeric(0)

# benchmark 
microbenchmark::microbenchmark(
  cpp = cpp_partial_sort(x, rk)[1:rk], 
  snoram = -sort(-x, partial = 1:5)[1:5],
  OP = sort(x, decreasing = TRUE)[1:5],
  sotos_check = x[order(x, decreasing = TRUE)][1:5],
  jogo = {p <- length(x) - 0:4; sort(x, partial = p)[p]}, times = 1000)
#R> Unit: microseconds
#R>         expr   min    lq  mean median  uq  max neval
#R>          cpp  23.7  26.1  32.2     27  28 4384  1000
#R>       snoram 174.3 185.2 208.3    188 194 3968  1000
#R>           OP 528.6 558.4 595.9    562 574 4630  1000
#R>  sotos_check 474.9 504.4 550.7    507 519 4446  1000
#R>         jogo 172.1 182.1 194.7    186 190 3744  1000

There is the compilation time but this can be offset if cpp_partial_sort is called many times. The solution can possibly made more generic with a template version like I show here.

来源：https://stackoverflow.com/questions/54303287/decreasing-partial-sorting

标签

sorting