Elementwise matrix multiplication: R versus Rcpp (How to speed this code up?)

后端未结

关注

 3  1194

北荒 2021-02-03 14:30

I am new to C++ programming (using Rcpp for seamless integration into R), and I would appreciate some advice on how to speed up some calcu

3条回答

甜味超标 (楼主)

2021-02-03 15:12

My apologies for giving an essentially C answer to a C++ question, but as has been suggested the solution generally lies in the efficient BLAS implementation of things. Unfortunately, BLAS itself lacks a Hadamard multiply so you would have to implement your own.

Here is a pure Rcpp implementation that basically calls C code. If you want to make it proper C++, the worker function can be templated but for most applications using R that isn't a concern. Note that this also operates "in-place", which means that it modifies X without copying it.

// it may be necessary on your system to uncomment one of the following
//#define restrict __restrict__ // gcc/clang
//#define restrict __restrict   // MS Visual Studio
//#define restrict              // remove it completely

#include 
using namespace Rcpp;

#include 
using std::size_t;

void hadamardMultiplyMatrixByVectorInPlace(double* restrict x,
                                           size_t numRows, size_t numCols,
                                           const double* restrict y)
{
  if (numRows == 0 || numCols == 0) return;

  for (size_t col = 0; col < numCols; ++col) {
    double* restrict x_col = x + col * numRows;

    for (size_t row = 0; row < numRows; ++row) {
      x_col[row] *= y[row];
    }
  }
}

// [[Rcpp::export]]
NumericMatrix C_matvecprod_elwise_inplace(NumericMatrix& X,
                                          const NumericVector& y)
{
  // do some dimension checking here

  hadamardMultiplyMatrixByVectorInPlace(X.begin(), X.nrow(), X.ncol(),
                                        y.begin());

  return X;
}

Here is a version that makes a copy first. I don't know Rcpp well enough to do this natively and not incur a substantial performance hit. Creating and returning a NumericMatrix(numRows, numCols) on the stack causes the code to run about 30% slower.

#include 
using namespace Rcpp;

#include 
using std::size_t;

#include 
#include 

void hadamardMultiplyMatrixByVector(const double* restrict x,
                                    size_t numRows, size_t numCols,
                                    const double* restrict y,
                                    double* restrict z)
{
  if (numRows == 0 || numCols == 0) return;

  for (size_t col = 0; col < numCols; ++col) {
    const double* restrict x_col = x + col * numRows;
    double* restrict z_col = z + col * numRows;

    for (size_t row = 0; row < numRows; ++row) {
      z_col[row] = x_col[row] * y[row];
    }
  }
}

// [[Rcpp::export]]
SEXP C_matvecprod_elwise(const NumericMatrix& X, const NumericVector& y)
{
  size_t numRows = X.nrow();
  size_t numCols = X.ncol();

  // do some dimension checking here

  SEXP Z = PROTECT(Rf_allocVector(REALSXP, (int) (numRows * numCols)));
  SEXP dimsExpr = PROTECT(Rf_allocVector(INTSXP, 2));
  int* dims = INTEGER(dimsExpr);
  dims[0] = (int) numRows;
  dims[1] = (int) numCols;
  Rf_setAttrib(Z, R_DimSymbol, dimsExpr);

  hadamardMultiplyMatrixByVector(X.begin(), X.nrow(), X.ncol(), y.begin(), REAL(Z));

  UNPROTECT(2);

  return Z;
}

If you're curious about usage of restrict, it means that you as the programmer enter a contract with the compiler that different bits of memory do not overlap, allowing the compiler to make certain optimizations. The restrict keyword is part of C++11 (and C99), but many compilers added extensions to C++ for earlier standards.

Some R code to benchmark:

require(rbenchmark)

n <- 50000
k <- 50
X <- matrix(rnorm(n*k), nrow=n)
e <- rnorm(n)

R_matvecprod_elwise <- function(mat, vec) mat*vec

all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise(X, e))
X_dup <- X + 0
all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise_inplace(X_dup, e))

benchmark(R_matvecprod_elwise(X, e),
          C_matvecprod_elwise(X, e),
          C_matvecprod_elwise_inplace(X, e),
          columns = c("test", "replications", "elapsed", "relative"),
          order = "relative", replications = 1000)

And the results:

                               test replications elapsed relative
3 C_matvecprod_elwise_inplace(X, e)         1000   3.317    1.000
2         C_matvecprod_elwise(X, e)         1000   7.174    2.163
1         R_matvecprod_elwise(X, e)         1000  10.670    3.217

Finally, the in-place version may actually be faster, as the repeated multiplications into the same matrix can cause some overflow mayhem.

Edit:

Removed the loop unrolling, as it provided no benefit and was otherwise distracting.

0 讨论(0)

查看其它3个回答