I am new to C++ programming (using Rcpp for seamless integration into R), and I would appreciate some advice on how to speed up some calcu
My apologies for giving an essentially C answer to a C++ question, but as has been suggested the solution generally lies in the efficient BLAS implementation of things. Unfortunately, BLAS itself lacks a Hadamard multiply so you would have to implement your own.
Here is a pure Rcpp implementation that basically calls C code. If you want to make it proper C++, the worker function can be templated but for most applications using R that isn't a concern. Note that this also operates "in-place", which means that it modifies X without copying it.
// it may be necessary on your system to uncomment one of the following
//#define restrict __restrict__ // gcc/clang
//#define restrict __restrict // MS Visual Studio
//#define restrict // remove it completely
#include
using namespace Rcpp;
#include
using std::size_t;
void hadamardMultiplyMatrixByVectorInPlace(double* restrict x,
size_t numRows, size_t numCols,
const double* restrict y)
{
if (numRows == 0 || numCols == 0) return;
for (size_t col = 0; col < numCols; ++col) {
double* restrict x_col = x + col * numRows;
for (size_t row = 0; row < numRows; ++row) {
x_col[row] *= y[row];
}
}
}
// [[Rcpp::export]]
NumericMatrix C_matvecprod_elwise_inplace(NumericMatrix& X,
const NumericVector& y)
{
// do some dimension checking here
hadamardMultiplyMatrixByVectorInPlace(X.begin(), X.nrow(), X.ncol(),
y.begin());
return X;
}
Here is a version that makes a copy first. I don't know Rcpp well enough to do this natively and not incur a substantial performance hit. Creating and returning a NumericMatrix(numRows, numCols) on the stack causes the code to run about 30% slower.
#include
using namespace Rcpp;
#include
using std::size_t;
#include
#include
void hadamardMultiplyMatrixByVector(const double* restrict x,
size_t numRows, size_t numCols,
const double* restrict y,
double* restrict z)
{
if (numRows == 0 || numCols == 0) return;
for (size_t col = 0; col < numCols; ++col) {
const double* restrict x_col = x + col * numRows;
double* restrict z_col = z + col * numRows;
for (size_t row = 0; row < numRows; ++row) {
z_col[row] = x_col[row] * y[row];
}
}
}
// [[Rcpp::export]]
SEXP C_matvecprod_elwise(const NumericMatrix& X, const NumericVector& y)
{
size_t numRows = X.nrow();
size_t numCols = X.ncol();
// do some dimension checking here
SEXP Z = PROTECT(Rf_allocVector(REALSXP, (int) (numRows * numCols)));
SEXP dimsExpr = PROTECT(Rf_allocVector(INTSXP, 2));
int* dims = INTEGER(dimsExpr);
dims[0] = (int) numRows;
dims[1] = (int) numCols;
Rf_setAttrib(Z, R_DimSymbol, dimsExpr);
hadamardMultiplyMatrixByVector(X.begin(), X.nrow(), X.ncol(), y.begin(), REAL(Z));
UNPROTECT(2);
return Z;
}
If you're curious about usage of restrict, it means that you as the programmer enter a contract with the compiler that different bits of memory do not overlap, allowing the compiler to make certain optimizations. The restrict keyword is part of C++11 (and C99), but many compilers added extensions to C++ for earlier standards.
Some R code to benchmark:
require(rbenchmark)
n <- 50000
k <- 50
X <- matrix(rnorm(n*k), nrow=n)
e <- rnorm(n)
R_matvecprod_elwise <- function(mat, vec) mat*vec
all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise(X, e))
X_dup <- X + 0
all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise_inplace(X_dup, e))
benchmark(R_matvecprod_elwise(X, e),
C_matvecprod_elwise(X, e),
C_matvecprod_elwise_inplace(X, e),
columns = c("test", "replications", "elapsed", "relative"),
order = "relative", replications = 1000)
And the results:
test replications elapsed relative
3 C_matvecprod_elwise_inplace(X, e) 1000 3.317 1.000
2 C_matvecprod_elwise(X, e) 1000 7.174 2.163
1 R_matvecprod_elwise(X, e) 1000 10.670 3.217
Finally, the in-place version may actually be faster, as the repeated multiplications into the same matrix can cause some overflow mayhem.
Edit:
Removed the loop unrolling, as it provided no benefit and was otherwise distracting.