Why is my Rcpp implementation for finding the number of unique items slower than base R?

☆樱花仙子☆ 提交于 2019-12-04 16:53:01

First, please make examples reproducible. The above lacks Rcpp Attributes, C++11 plugin, and necessary header imports.

Second, the issue that is being displayed here is the cost of performing a deep copy of the data out of R to a C++ structure. The majority of time in your benchmark is being spent in the copy process. This process is triggered by the choice of using an std::vector<std::string> instead of an Rcpp::CharacterVector, which holds the SEXP, s-expression, or a pointer to the data. By negating the proxy model offered by Rcpp objects, which only performs shallow copies, there will be an immediate cost to data importation into C++ that is much larger than mere microseconds described within Why is this simplistic cpp function version slower?.

Having said that, let's talk how to modify the above example to use Rcpp objects. First, note that Rcpp objects have a member function called .sort() that accurately sorts an Rcpp::CharacterVector with missing values (see Rcpp FAQ Section 5.5: Sorting with STL on a CharacterVector produces problematic results for details this assumes no capitalize or special locale). Secondly, the SEXP type can be used as a way to construct the std::unordered_set even with the data being imported as an Rcpp::CharacterVector. These modifications can be found in the C++ functions with "native" in their declaration.

#include <Rcpp.h>
#include <unordered_set>
#include <algorithm>

// [[Rcpp::plugins(cpp11)]]

// [[Rcpp::export]]
int unique_sort(std::vector<std::string> x) {
  sort(x.begin(), x.end());
  return unique(x.begin(), x.end()) - x.begin();
}

// [[Rcpp::export]]
int unique_set(std::vector<std::string> x) {
  std::unordered_set<std::string> tab(x.begin(), x.end());
  return tab.size();
}

// [[Rcpp::export]]
int unique_sort_native(Rcpp::CharacterVector x) {
  x.sort();
  return std::unique(x.begin(), x.end()) - x.begin();
}

// [[Rcpp::export]]
int unique_set_native(Rcpp::CharacterVector x) {
  std::unordered_set<SEXP> tab(x.begin(), x.end());
  return tab.size();
}

Test Code:

# install.packages(c("microbenchmark"))

# Note, it is more efficient to supply an integer rather than a vector
# in sample()'s first parameter.
x <- paste0("x", sample(1e5, 1e7, replace=T))

# Run a microbenchmark
microbenchmark::microbenchmark(
  length(unique(x)),
  length(unique.default(x)),
  unique_sort(x),
  unique_set(x),
  unique_sort_native(x),
  unique_set_native(x),
  times = 10
)

Output:

Unit: milliseconds
                      expr     min      lq    mean  median      uq     max neval
         length(unique(x))   208.0   235.3   235.7   237.2   240.2   247.4    10
 length(unique.default(x))   230.9   232.8   238.8   233.7   241.8   266.6    10
            unique_sort(x) 12759.4 12877.1 12993.8 12920.1 13043.2 13416.7    10
             unique_set(x)  2528.1  2545.3  2590.1  2590.3  2631.3  2670.1    10
     unique_sort_native(x)  7452.6  7482.4  7568.5  7509.0  7563.6  7917.8    10
      unique_set_native(x)   175.8   176.9   179.2   178.3   182.3   183.4    10

So, when avoiding a deep copy by using Rcpp objects, the unique_set_native function beats out the length(unique()) call by around 30 milliseconds.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!