Elementwise matrix multiplication: R versus Rcpp (How to speed this code up?)

后端 未结 3 1194
北荒
北荒 2021-02-03 14:30

I am new to C++ programming (using Rcpp for seamless integration into R), and I would appreciate some advice on how to speed up some calcu

3条回答
  •  甜味超标
    2021-02-03 15:12

    My apologies for giving an essentially C answer to a C++ question, but as has been suggested the solution generally lies in the efficient BLAS implementation of things. Unfortunately, BLAS itself lacks a Hadamard multiply so you would have to implement your own.

    Here is a pure Rcpp implementation that basically calls C code. If you want to make it proper C++, the worker function can be templated but for most applications using R that isn't a concern. Note that this also operates "in-place", which means that it modifies X without copying it.

    // it may be necessary on your system to uncomment one of the following
    //#define restrict __restrict__ // gcc/clang
    //#define restrict __restrict   // MS Visual Studio
    //#define restrict              // remove it completely
    
    #include 
    using namespace Rcpp;
    
    #include 
    using std::size_t;
    
    void hadamardMultiplyMatrixByVectorInPlace(double* restrict x,
                                               size_t numRows, size_t numCols,
                                               const double* restrict y)
    {
      if (numRows == 0 || numCols == 0) return;
    
      for (size_t col = 0; col < numCols; ++col) {
        double* restrict x_col = x + col * numRows;
    
        for (size_t row = 0; row < numRows; ++row) {
          x_col[row] *= y[row];
        }
      }
    }
    
    // [[Rcpp::export]]
    NumericMatrix C_matvecprod_elwise_inplace(NumericMatrix& X,
                                              const NumericVector& y)
    {
      // do some dimension checking here
    
      hadamardMultiplyMatrixByVectorInPlace(X.begin(), X.nrow(), X.ncol(),
                                            y.begin());
    
      return X;
    }
    

    Here is a version that makes a copy first. I don't know Rcpp well enough to do this natively and not incur a substantial performance hit. Creating and returning a NumericMatrix(numRows, numCols) on the stack causes the code to run about 30% slower.

    #include 
    using namespace Rcpp;
    
    #include 
    using std::size_t;
    
    #include 
    #include 
    
    void hadamardMultiplyMatrixByVector(const double* restrict x,
                                        size_t numRows, size_t numCols,
                                        const double* restrict y,
                                        double* restrict z)
    {
      if (numRows == 0 || numCols == 0) return;
    
      for (size_t col = 0; col < numCols; ++col) {
        const double* restrict x_col = x + col * numRows;
        double* restrict z_col = z + col * numRows;
    
        for (size_t row = 0; row < numRows; ++row) {
          z_col[row] = x_col[row] * y[row];
        }
      }
    }
    
    // [[Rcpp::export]]
    SEXP C_matvecprod_elwise(const NumericMatrix& X, const NumericVector& y)
    {
      size_t numRows = X.nrow();
      size_t numCols = X.ncol();
    
      // do some dimension checking here
    
      SEXP Z = PROTECT(Rf_allocVector(REALSXP, (int) (numRows * numCols)));
      SEXP dimsExpr = PROTECT(Rf_allocVector(INTSXP, 2));
      int* dims = INTEGER(dimsExpr);
      dims[0] = (int) numRows;
      dims[1] = (int) numCols;
      Rf_setAttrib(Z, R_DimSymbol, dimsExpr);
    
      hadamardMultiplyMatrixByVector(X.begin(), X.nrow(), X.ncol(), y.begin(), REAL(Z));
    
      UNPROTECT(2);
    
      return Z;
    }
    

    If you're curious about usage of restrict, it means that you as the programmer enter a contract with the compiler that different bits of memory do not overlap, allowing the compiler to make certain optimizations. The restrict keyword is part of C++11 (and C99), but many compilers added extensions to C++ for earlier standards.

    Some R code to benchmark:

    require(rbenchmark)
    
    n <- 50000
    k <- 50
    X <- matrix(rnorm(n*k), nrow=n)
    e <- rnorm(n)
    
    R_matvecprod_elwise <- function(mat, vec) mat*vec
    
    all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise(X, e))
    X_dup <- X + 0
    all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise_inplace(X_dup, e))
    
    benchmark(R_matvecprod_elwise(X, e),
              C_matvecprod_elwise(X, e),
              C_matvecprod_elwise_inplace(X, e),
              columns = c("test", "replications", "elapsed", "relative"),
              order = "relative", replications = 1000)
    

    And the results:

                                   test replications elapsed relative
    3 C_matvecprod_elwise_inplace(X, e)         1000   3.317    1.000
    2         C_matvecprod_elwise(X, e)         1000   7.174    2.163
    1         R_matvecprod_elwise(X, e)         1000  10.670    3.217
    

    Finally, the in-place version may actually be faster, as the repeated multiplications into the same matrix can cause some overflow mayhem.

    Edit:

    Removed the loop unrolling, as it provided no benefit and was otherwise distracting.

提交回复
热议问题