问题
I'm trying to build a super fast mode function for R to use for aggregating large categorical datasets. The function should take vector input of all supported R types and return the mode. I have read This post, This Help-page and others, but I was not able to make the function take in all R data types. My code now works for numeric vectors, I am relying on Rcpp sugar wrapper functions:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int Mode(NumericVector x, bool narm = false)
{
if (narm) x = x[!is_na(x)];
NumericVector ux = unique(x);
int y = ux[which_max(table(match(x, ux)))];
return y;
}
In addition I was wondering if the 'narm' argument can be renamed 'na.rm' without giving errors, and of course if there is a faster way to code a mode function in C++, I would be grateful to know about it.
回答1:
In order to make the function work for any vector input, you could implement @JosephWood's algorithm for any data type you want to support and call it from a switch(TYPEOF(x))
. But that would be lots of code duplication. Instead, it is better to make a generic function that can work on any Vector<RTYPE>
argument. If we follow R's paradigm that everything is a vector and let the function also return a Vector<RTYPE>
, then we can make use of RCPP_RETURN_VECTOR. Note that we need C++11 to be able to pass additional arguments to the function called by RCPP_RETURN_VECTOR
. One tricky thing is that you need the storage type for Vector<RTYPE>
in order to create a suitable std::unordered_map
. Here Rcpp::traits::storage_type<RTYPE>::type
comes to the rescue. However, std::unordered_map
does not know how to deal with complex numbers from R. For simplicity, I am disabling this special case.
Putting it all together:
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>
template <int RTYPE>
Vector<RTYPE> fastModeImpl(Vector<RTYPE> x, bool narm){
if (narm) x = x[!is_na(x)];
int myMax = 1;
Vector<RTYPE> myMode(1);
// special case for factors == INTSXP with "class" and "levels" attribute
if (x.hasAttribute("levels")){
myMode.attr("class") = x.attr("class");
myMode.attr("levels") = x.attr("levels");
}
std::unordered_map<typename Rcpp::traits::storage_type<RTYPE>::type, int> modeMap;
modeMap.reserve(x.size());
for (std::size_t i = 0, len = x.size(); i < len; ++i) {
auto it = modeMap.find(x[i]);
if (it != modeMap.end()) {
++(it->second);
if (it->second > myMax) {
myMax = it->second;
myMode[0] = x[i];
}
} else {
modeMap.insert({x[i], 1});
}
}
return myMode;
}
template <>
Vector<CPLXSXP> fastModeImpl(Vector<CPLXSXP> x, bool narm) {
stop("Not supported SEXP type!");
}
// [[Rcpp::export]]
SEXP fastMode( SEXP x, bool narm = false ){
RCPP_RETURN_VECTOR(fastModeImpl, x, narm);
}
/*** R
set.seed(1234)
s <- sample(1e5, replace = TRUE)
fastMode(s)
fastMode(s + 0.1)
l <- sample(c(TRUE, FALSE), 11, replace = TRUE)
fastMode(l)
c <- sample(letters, 1e5, replace = TRUE)
fastMode(c)
f <- as.factor(c)
fastMode(f)
*/
Output:
> set.seed(1234)
> s <- sample(1e5, replace = TRUE)
> fastMode(s)
[1] 85433
> fastMode(s + 0.1)
[1] 85433.1
> l <- sample(c(TRUE, FALSE), 11, replace = TRUE)
> fastMode(l)
[1] TRUE
> c <- sample(letters, 1e5, replace = TRUE)
> fastMode(c)
[1] "z"
> f <- as.factor(c)
> fastMode(f)
[1] z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
As noted above, the used algorithm comes from Joseph Wood's answer, which has been explicitly dual-licensed under CC-BY-SA and GPL >= 2. I am following Joseph and hereby license the code in this answer under the GPL (version 2 or later) in addition to the implicit CC-BY-SA license.
回答2:
In your Mode
function, since you are mostly calling sugar wrapper functions, you won't see that much improvement over base R
. In fact, simply writing a faithful base R translation, we have:
baseMode <- function(x, narm = FALSE) {
if (narm) x <- x[!is.na(x)]
ux <- unique(x)
ux[which.max(table(match(x, ux)))]
}
And benchmarking, we have:
set.seed(1234)
s <- sample(1e5, replace = TRUE)
library(microbenchmark)
microbenchmark(Mode(s), baseMode(s), times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
Mode(s) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
baseMode(s) 1.490765 1.645367 1.571132 1.616061 1.637181 1.448306 10
Typically, when we undertake the effort of writing our own compiled code, we would expect bigger gains. Simply wrapping these already efficient compiled functions in Rcpp
isn't going to magically get you the gains you expect. In fact, on larger examples the base solution is faster. Observe:
set.seed(1234)
sBig <- sample(1e6, replace = TRUE)
system.time(Mode(sBig))
user system elapsed
1.410 0.036 1.450
system.time(baseMode(sBig))
user system elapsed
0.915 0.025 0.943
To address your question of writing a faster mode function, we can make use of std::unordered_map
, which is very similar to table
underneath the hood (i.e. they are both hash tables at their heart). Additionally, since you are returning a single integer, we can safely assume that we can replace NumericVector
with IntegerVector
and also that you are not concerned with returning every value that occurs the most.
The algorithm below can be modified to return the true mode, but I will leave that as an exercise (hint: you will need std::vector
along with taking some sort of action when it->second == myMax
). N.B. you will also need to add // [[Rcpp::plugins(cpp11)]]
at the top of your cpp file for std::unordered_map
and auto
.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>
// [[Rcpp::export]]
int fastIntMode(IntegerVector x, bool narm = false) {
if (narm) x = x[!is_na(x)];
int myMax = 1;
int myMode = 0;
std::unordered_map<int, int> modeMap;
modeMap.reserve(x.size());
for (std::size_t i = 0, len = x.size(); i < len; ++i) {
auto it = modeMap.find(x[i]);
if (it != modeMap.end()) {
++(it->second);
if (it->second > myMax) {
myMax = it->second;
myMode = x[i];
}
} else {
modeMap.insert({x[i], 1});
}
}
return myMode;
}
And the benchmarks:
microbenchmark(Mode(s), baseMode(s), fastIntMode(s), times = 15, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
Mode(s) 6.428343 6.268131 6.622914 6.134388 6.881746 7.78522 15
baseMode(s) 9.757491 9.404101 9.454857 9.169315 9.018938 10.16640 15
fastIntMode(s) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 15
Now we are talking... about 6x faster than the original and 9x faster than base. They all return the same value:
fastIntMode(s)
##[1] 85433
baseMode(s)
##[1] 85433
Mode(s)
##[1] 85433
And for our larger example:
## base R returned in 0.943s
system.time(fastIntMode(s))
user system elapsed
0.217 0.006 0.224
In addition to the implicit CC-BY-SA
license I hereby license the code in this answer under the GPL >= 2
.
来源:https://stackoverflow.com/questions/55212746/rcpp-fast-statistical-mode-function-with-vector-input-of-any-type