I have a big performance problem in R. I wrote a function that iterates over a data.frame
object. It simply adds a new column to a data.frame
and a
As Ari mentioned at the end of his answer, the Rcpp
and inline
packages make it incredibly easy to make things fast. As an example, try this inline
code (warning: not tested):
body <- 'Rcpp::NumericMatrix nm(temp);
int nrtemp = Rccp::as(nrt);
for (int i = 0; i < nrtemp; ++i) {
temp(i, 9) = i
if (i > 1) {
if ((temp(i, 5) == temp(i - 1, 5) && temp(i, 2) == temp(i - 1, 2) {
temp(i, 9) = temp(i, 8) + temp(i - 1, 9)
} else {
temp(i, 9) = temp(i, 8)
}
} else {
temp(i, 9) = temp(i, 8)
}
return Rcpp::wrap(nm);
'
settings <- getPlugin("Rcpp")
# settings$env$PKG_CXXFLAGS <- paste("-I", getwd(), sep="") if you want to inc files in wd
dayloop <- cxxfunction(signature(nrt="numeric", temp="numeric"), body-body,
plugin="Rcpp", settings=settings, cppargs="-I/usr/include")
dayloop2 <- function(temp) {
# extract a numeric matrix from temp, put it in tmp
nc <- ncol(temp)
nm <- dayloop(nc, temp)
names(temp)[names(temp) == "V10"] <- "Kumm."
return(temp)
}
There's a similar procedure for #include
ing things, where you just pass a parameter
inc <- '#include
to cxxfunction, as include=inc
. What's really cool about this is that it does all of the linking and compilation for you, so prototyping is really fast.
Disclaimer: I'm not totally sure that the class of tmp should be numeric and not numeric matrix or something else. But I'm mostly sure.
Edit: if you still need more speed after this, OpenMP is a parallelization facility good for C++
. I haven't tried using it from inline
, but it should work. The idea would be to, in the case of n
cores, have loop iteration k
be carried out by k % n
. A suitable introduction is found in Matloff's The Art of R Programming, available here, in chapter 16, Resorting to C.