问题
As ?sort
said, if the argument partial is not NULL, it is taken to contain indices of elements of the result which are to be placed in their correct positions in the sorted array by partial sorting. You can read Argument “partial” of the sort function in R
for detail. So in the case that I need to find the smallest 5 numbers in x <- sample(1:100, 50)
, then
sort(x, partial = 1:5)[1:5]
will be faster than
sort(x)[1:5]
However, how could I find the largest 5 numbers with partial sorting? Intuitively, I try to use:
sort(x, partial = 1:5, decreasing = T)
but it gets
Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) : unsupported options for partial sorting
Therefore, my question is how to achieve the effect of efficiency in this case.
回答1:
You can take the tail from the sorted vector:
set.seed(42)
x <- sample(1:100, 50)
# sort(x, partial = 1:5)[1:5] ## head
p <- length(x)+1 - (1:5) ## tail
sort(x, partial = p)[p]
If you want you can reverse the result using rev()
回答2:
You might still benefit from the speed boost with something like (assuming numeric data):
-sort(-x, partial = 1:5)[1:5]
Benchmarking:
set.seed(3)
x <- sample(1:100000, 500000, replace = TRUE)
bench::mark(
snoram = -sort(-x, partial = 1:5)[1:5],
OP = sort(x, decreasing = TRUE)[1:5],
sotos_check = x[order(x, decreasing = TRUE)][1:5],
jogo = {p <- length(x) - 0:4; sort(x, partial = p)[p]}
)
# A tibble: 4 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time gc
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <list> <list>
1 snoram 6.87ms 7.77ms 7.43ms 15.04ms 129. 5.72MB 9 34 264ms <int [5]> <Rprofmem [3 x 3]> <bch:tm> <tibble [43 x 3]>
2 OP 17.4ms 18.96ms 18.56ms 24.37ms 52.7 3.81MB 3 21 398ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [24 x 3]>
3 sotos_check 14.65ms 17.07ms 16.48ms 25.58ms 58.6 3.81MB 4 23 393ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [27 x 3]>
4 jogo 4.98ms 5.45ms 5.35ms 8.91ms 184. 3.81MB 6 37 201ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [43 x 3]>
回答3:
You can also use C++'s partial_sort
through Rcpp with a file with the following content:
include "Rcpp.h"
#include <algorithm>
using namespace Rcpp;
inline bool rev_comp(double const i, double const j){
return i > j;
}
// [[Rcpp::export(rng = false)]]
NumericVector cpp_partial_sort(NumericVector x, unsigned const k) {
if(k >= x.size() or k < 1)
throw std::invalid_argument("Invalid k");
if(k + 1 == x.size())
return x;
NumericVector out = clone(x);
std::partial_sort(&out[0], &out[k + 1], &out[x.size() - 1], rev_comp);
return out;
}
We can now confirm that we get the same and make a benchmark:
# simulate data
set.seed(2)
x <- rnorm(10000)
# they all give the same
rk <- 5
setdiff(cpp_partial_sort(x, rk)[1:rk],
-sort(-x, partial = 1:rk)[1:rk])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk],
sort(x, decreasing = TRUE)[1:5])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk],
x[order(x, decreasing = TRUE)][1:rk])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk],
{ p <- length(x) - 0:(rk - 1); sort(x, partial = p)[p] })
#R> numeric(0)
# benchmark
microbenchmark::microbenchmark(
cpp = cpp_partial_sort(x, rk)[1:rk],
snoram = -sort(-x, partial = 1:5)[1:5],
OP = sort(x, decreasing = TRUE)[1:5],
sotos_check = x[order(x, decreasing = TRUE)][1:5],
jogo = {p <- length(x) - 0:4; sort(x, partial = p)[p]}, times = 1000)
#R> Unit: microseconds
#R> expr min lq mean median uq max neval
#R> cpp 23.7 26.1 32.2 27 28 4384 1000
#R> snoram 174.3 185.2 208.3 188 194 3968 1000
#R> OP 528.6 558.4 595.9 562 574 4630 1000
#R> sotos_check 474.9 504.4 550.7 507 519 4446 1000
#R> jogo 172.1 182.1 194.7 186 190 3744 1000
There is the compilation time but this can be offset if cpp_partial_sort
is called many times. The solution can possibly made more generic with a template version like I show here.
来源:https://stackoverflow.com/questions/54303287/decreasing-partial-sorting