Why is my Rcpp implementation for finding the number of unique items slower than base R?

I'm trying to write a function to count the number of unique items in a string vector (my problem is slightly more complex, but this is reproducible. I did this based on answers I found for C++. Here is my code:

C++

int unique_sort(vector<string> x) {
    sort(x.begin(), x.end());
    return unique(x.begin(), x.end()) - x.begin();
}

int unique_set(vector<string> x) {
    unordered_set<string> tab(x.begin(), x.end());
    return tab.size();
}

x <- paste0("x", sample(1:1e5, 1e7, replace=T))
microbenchmark(length(unique(x)),unique_sort(x), unique_set(x), times=3)

Results:

Unit: milliseconds
              expr        min         lq       mean     median         uq
 length(unique(x))   365.0213   373.4018   406.0209   381.7823   426.5206
    unique_sort(x) 10732.1918 10847.0532 10907.6882 10961.9146 10995.4363
     unique_set(x)  1948.6517  2230.3383  2334.4040  2512.0249  2527.2802

Looking at the R source code for the unique function (it is a bit hard to follow), it seems to use a loop over the array to add unique elements to a hash and checks that hash if element already exist.

Therefore, I thought it should be equivalent to the unordered_set approach. I don't understand why the unordered_set approach is 5 times slower.

TLDR: why is my C++ code slow?

First, please make examples reproducible. The above lacks Rcpp Attributes, C++11 plugin, and necessary header imports.

Second, the issue that is being displayed here is the cost of performing a deep copy of the data out of R to a C++ structure. The majority of time in your benchmark is being spent in the copy process. This process is triggered by the choice of using an std::vector<std::string> instead of an Rcpp::CharacterVector, which holds the SEXP, s-expression, or a pointer to the data. By negating the proxy model offered by Rcpp objects, which only performs shallow copies, there will be an immediate cost to data importation into C++ that is much larger than mere microseconds described within Why is this simplistic cpp function version slower?.

Having said that, let's talk how to modify the above example to use Rcpp objects. First, note that Rcpp objects have a member function called .sort() that accurately sorts an Rcpp::CharacterVector with missing values (see Rcpp FAQ Section 5.5: Sorting with STL on a CharacterVector produces problematic results for details this assumes no capitalize or special locale). Secondly, the SEXP type can be used as a way to construct the std::unordered_set even with the data being imported as an Rcpp::CharacterVector. These modifications can be found in the C++ functions with "native" in their declaration.

#include <Rcpp.h>
#include <unordered_set>
#include <algorithm>

// [[Rcpp::plugins(cpp11)]]

// [[Rcpp::export]]
int unique_sort(std::vector<std::string> x) {
  sort(x.begin(), x.end());
  return unique(x.begin(), x.end()) - x.begin();
}

// [[Rcpp::export]]
int unique_set(std::vector<std::string> x) {
  std::unordered_set<std::string> tab(x.begin(), x.end());
  return tab.size();
}

// [[Rcpp::export]]
int unique_sort_native(Rcpp::CharacterVector x) {
  x.sort();
  return std::unique(x.begin(), x.end()) - x.begin();
}

// [[Rcpp::export]]
int unique_set_native(Rcpp::CharacterVector x) {
  std::unordered_set<SEXP> tab(x.begin(), x.end());
  return tab.size();
}

Test Code:

# install.packages(c("microbenchmark"))

# Note, it is more efficient to supply an integer rather than a vector
# in sample()'s first parameter.
x <- paste0("x", sample(1e5, 1e7, replace=T))

# Run a microbenchmark
microbenchmark::microbenchmark(
  length(unique(x)),
  length(unique.default(x)),
  unique_sort(x),
  unique_set(x),
  unique_sort_native(x),
  unique_set_native(x),
  times = 10
)

Output:

Unit: milliseconds
                      expr     min      lq    mean  median      uq     max neval
         length(unique(x))   208.0   235.3   235.7   237.2   240.2   247.4    10
 length(unique.default(x))   230.9   232.8   238.8   233.7   241.8   266.6    10
            unique_sort(x) 12759.4 12877.1 12993.8 12920.1 13043.2 13416.7    10
             unique_set(x)  2528.1  2545.3  2590.1  2590.3  2631.3  2670.1    10
     unique_sort_native(x)  7452.6  7482.4  7568.5  7509.0  7563.6  7917.8    10
      unique_set_native(x)   175.8   176.9   179.2   178.3   182.3   183.4    10

So, when avoiding a deep copy by using Rcpp objects, the unique_set_native function beats out the length(unique()) call by around 30 milliseconds.

来源：https://stackoverflow.com/questions/45024568/why-is-my-rcpp-implementation-for-finding-the-number-of-unique-items-slower-than

标签

c++

rcpp