How to optimize Read and Write to subsections of a matrix in R (possibly using data.table)

后端 未结 2 669
耶瑟儿~
耶瑟儿~ 2020-12-13 17:27

TL;DR

What is the fastest method in R for reading and writing a subset of columns from a very large matrix. I attempt a solution with data

相关标签:
2条回答
  • 2020-12-13 17:50

    Here's what I had in mind. This could probably be much sexier with Rcpp and friends, but I'm not as familiar with those tools.

    #include <R.h>
    #include <Rinternals.h>
    #include <Rdefines.h>
    SEXP addCol(SEXP mat, SEXP loc, SEXP matAdd)
    {
      int i, nr = nrows(mat), nc = ncols(matAdd), ll = length(loc);
      if(ll != nc)
        error("length(loc) must equal ncol(matAdd)");
      if(TYPEOF(mat) != TYPEOF(matAdd))
        error("mat and matAdd must be the same type");
      if(nr != nrows(matAdd))
        error("mat and matAdd must have the same number of rows");
      if(TYPEOF(loc) != INTSXP)
        error("loc must be integer");
      int *iloc = INTEGER(loc);
    
      switch(TYPEOF(mat)) {
        case REALSXP:
          for(i=0; i < ll; i++)
            memcpy(&(REAL(mat)[(iloc[i]-1)*nr]),
                   &(REAL(matAdd)[i*nr]), nr*sizeof(double));
          break;
        case INTSXP:
          for(i=0; i < ll; i++)
            memcpy(&(INTEGER(mat)[(iloc[i]-1)*nr]),
                   &(INTEGER(matAdd)[i*nr]), nr*sizeof(int));
          break;
        default:
          error("unsupported type");
      }
      return R_NilValue;
    }
    

    Put the above function in addCol.c, then run R CMD SHLIB addCol.c. Then in R:

    addColC <- dyn.load("addCol.so")$addCol
    .Call(addColC, mat, Vsub, mat[,Vsub]+toinsert)
    

    The slight advantage to this approach over Roland's is that this only does the assignment. His function does the addition for you, which is faster, but also means you need a separate C/C++ function for every operation you need to do.

    0 讨论(0)
  • 2020-12-13 17:55

    Fun with Rcpp:

    You can use Eigen's Map class to modify an R object in place.

    library(RcppEigen)
    library(inline)
    
    incl <- '
    using  Eigen::Map;
    using  Eigen::MatrixXd;
    using  Eigen::VectorXi;
    typedef  Map<MatrixXd>  MapMatd;
    typedef  Map<VectorXi>  MapVeci;
    '
    
    body <- '
    MapMatd              A(as<MapMatd>(AA));
    const MapMatd        B(as<MapMatd>(BB));
    const MapVeci        ix(as<MapVeci>(ind));
    const int            mB(B.cols());
    for (int i = 0; i < mB; ++i) 
    {
    A.col(ix.coeff(i)-1) += B.col(i);
    }
    '
    
    funRcpp <- cxxfunction(signature(AA = "matrix", BB ="matrix", ind = "integer"), 
                           body, "RcppEigen", incl)
    
    set.seed(94253)
    K <- 100
    V <- 100000
    mat2 <-  mat <-  matrix(runif(K*V),nrow=K,ncol=V)
    
    Vsub <- sample(1:V, 20)
    toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))
    mat[,Vsub] <- mat[,Vsub] + toinsert
    
    invisible(funRcpp(mat2, toinsert, Vsub))
    all.equal(mat, mat2)
    #[1] TRUE
    
    library(microbenchmark)
    microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,
                   funRcpp(mat2, toinsert, Vsub))
    # Unit: microseconds
    #                                  expr    min     lq  median      uq       max neval
    # mat[, Vsub] <- mat[, Vsub] + toinsert 49.273 49.628 50.3250 50.8075 20020.400   100
    #         funRcpp(mat2, toinsert, Vsub)  6.450  6.805  7.6605  7.9215    25.914   100
    

    I think this is basically what @Joshua Ulrich proposed. His warnings regarding breaking R's functional paradigm apply.

    I do the addition in C++, but it is trivial to change the function to only do assignment.

    Obviously, if you can implement your whole loop in Rcpp, you avoid repeated function calls at the R level and will gain performance.

    0 讨论(0)
提交回复
热议问题