I have two matrices that I want to apply a function to, by rows:
matrixA
GSM83009 GSM83037 GSM83002 GSM83029 GSM83041
100001_at 5.873321 5.4
Splitting the matrices isn't the biggest contributor to evaluation time.
set.seed(21)
matrixA <- matrix(rnorm(5 * 9000), nrow = 9000)
matrixB <- matrix(rnorm(4 * 9000), nrow = 9000)
system.time( scores <- mapply(t.test.stat,
split(matrixA, row(matrixA)), split(matrixB, row(matrixB))) )
# user system elapsed
# 1.57 0.00 1.58
smA <- split(matrixA, row(matrixA))
smB <- split(matrixB, row(matrixB))
system.time( scores <- mapply(t.test.stat, smA, smB) )
# user system elapsed
# 1.14 0.00 1.14
Look at the output from Rprof
to see that most of the time is--not surprisingly--spent evaluating t.test.stat
(mean
, var
, etc.). Basically, there's quite a bit of overhead from function calls.
Rprof()
scores <- mapply(t.test.stat, smA, smB)
Rprof(NULL)
summaryRprof()
You may be able to find faster generalized solutions, but none will approach the speed of the vectorized solution below.
Since your function is simple, you can take advantage of the vectorized rowMeans
function to do this almost instantaneously (though it's a bit messy):
system.time({
ncA <- NCOL(matrixA)
ncB <- NCOL(matrixB)
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
sqrt( rowMeans((matrixA-rowMeans(matrixA))^2)*(ncA/(ncA-1))/ncA +
rowMeans((matrixB-rowMeans(matrixB))^2)*(ncB/(ncB-1))/ncB )
})
# user system elapsed
# 0 0 0
head(ans)
# [1] 0.8272511 -1.0965269 0.9862844 -0.6026452 -0.2477661 1.1896181
UPDATE
Here's a "cleaner" version using a rowVars
function:
rowVars <- function(x, na.rm=FALSE, dims=1L) {
rowMeans((x-rowMeans(x, na.rm, dims))^2, na.rm, dims)*(NCOL(x)/(NCOL(x)-1))
}
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
sqrt( rowVars(matrixA)/NCOL(matrixA) + rowVars(matrixB)/NCOL(matrixB) )
This solution avoids splitting, and lists, so maybe it will be faster than your version:
## original data:
tmp1 <- matrix(sample(1:100, 20), nrow = 5)
tmp2 <- matrix(sample(1:100, 20), nrow = 5)
## combine them together
tmp3 <- cbind(tmp1, tmp2)
## calculate t.stats:
t.stats <- apply(tmp3, 1, function(x) t.test(x[1:ncol(tmp1)],
x[(1 + ncol(tmp1)):ncol(tmp3)])$statistic)
Edit: Just tested it on two matrices of 9000 rows and 5 columns each, and it completed in less than 6 seconds:
tmp1 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp2 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp3 <- cbind(tmp1, tmp2)
system.time(t.st <- apply(tmp3, 1, function(x) t.test(x[1:5], x[6:10])$statistic))
-> user system elapsed
-> 5.640 0.012 5.705