What is the fastest method in R for reading and writing a subset of columns from a very large matrix. I attempt a solution with data
Here's what I had in mind. This could probably be much sexier with Rcpp and friends, but I'm not as familiar with those tools.
#include <R.h>
#include <Rinternals.h>
#include <Rdefines.h>
SEXP addCol(SEXP mat, SEXP loc, SEXP matAdd)
{
int i, nr = nrows(mat), nc = ncols(matAdd), ll = length(loc);
if(ll != nc)
error("length(loc) must equal ncol(matAdd)");
if(TYPEOF(mat) != TYPEOF(matAdd))
error("mat and matAdd must be the same type");
if(nr != nrows(matAdd))
error("mat and matAdd must have the same number of rows");
if(TYPEOF(loc) != INTSXP)
error("loc must be integer");
int *iloc = INTEGER(loc);
switch(TYPEOF(mat)) {
case REALSXP:
for(i=0; i < ll; i++)
memcpy(&(REAL(mat)[(iloc[i]-1)*nr]),
&(REAL(matAdd)[i*nr]), nr*sizeof(double));
break;
case INTSXP:
for(i=0; i < ll; i++)
memcpy(&(INTEGER(mat)[(iloc[i]-1)*nr]),
&(INTEGER(matAdd)[i*nr]), nr*sizeof(int));
break;
default:
error("unsupported type");
}
return R_NilValue;
}
Put the above function in addCol.c
, then run R CMD SHLIB addCol.c. Then in R:
addColC <- dyn.load("addCol.so")$addCol
.Call(addColC, mat, Vsub, mat[,Vsub]+toinsert)
The slight advantage to this approach over Roland's is that this only does the assignment. His function does the addition for you, which is faster, but also means you need a separate C/C++ function for every operation you need to do.
Fun with Rcpp:
You can use Eigen's Map class to modify an R object in place.
library(RcppEigen)
library(inline)
incl <- '
using Eigen::Map;
using Eigen::MatrixXd;
using Eigen::VectorXi;
typedef Map<MatrixXd> MapMatd;
typedef Map<VectorXi> MapVeci;
'
body <- '
MapMatd A(as<MapMatd>(AA));
const MapMatd B(as<MapMatd>(BB));
const MapVeci ix(as<MapVeci>(ind));
const int mB(B.cols());
for (int i = 0; i < mB; ++i)
{
A.col(ix.coeff(i)-1) += B.col(i);
}
'
funRcpp <- cxxfunction(signature(AA = "matrix", BB ="matrix", ind = "integer"),
body, "RcppEigen", incl)
set.seed(94253)
K <- 100
V <- 100000
mat2 <- mat <- matrix(runif(K*V),nrow=K,ncol=V)
Vsub <- sample(1:V, 20)
toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))
mat[,Vsub] <- mat[,Vsub] + toinsert
invisible(funRcpp(mat2, toinsert, Vsub))
all.equal(mat, mat2)
#[1] TRUE
library(microbenchmark)
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,
funRcpp(mat2, toinsert, Vsub))
# Unit: microseconds
# expr min lq median uq max neval
# mat[, Vsub] <- mat[, Vsub] + toinsert 49.273 49.628 50.3250 50.8075 20020.400 100
# funRcpp(mat2, toinsert, Vsub) 6.450 6.805 7.6605 7.9215 25.914 100
I think this is basically what @Joshua Ulrich proposed. His warnings regarding breaking R's functional paradigm apply.
I do the addition in C++, but it is trivial to change the function to only do assignment.
Obviously, if you can implement your whole loop in Rcpp, you avoid repeated function calls at the R level and will gain performance.