When trying to run the cor()
function on sparse matrices (of either type dgCMatrix or dgTMatrix) I get the following error:
Error in cor(x) : su
EDITED ANSWER - optimized for memory use and speed.
Your error is logic, as a sparse matrix is not recognized by the cor
function as a matrix, and there is -yet- no method for correlations in the Matrix
package.
There is no function I am aware of that will let you calculate this, but you can easily calculate that yourself, using the matrix operators that are available in the Matrix
package :
sparse.cor <- function(x){
n <- nrow(x)
m <- ncol(x)
ii <- unique(x@i)+1 # rows with a non-zero element
Ex <- colMeans(x)
nozero <- as.vector(x[ii,]) - rep(Ex,each=length(ii)) # colmeans
covmat <- ( crossprod(matrix(nozero,ncol=m)) +
crossprod(t(Ex))*(n-length(ii))
)/(n-1)
sdvec <- sqrt(diag(covmat))
covmat/crossprod(t(sdvec))
}
the covmat
is your variance-covariance matrix, so you can calculate that one as well. The calculation is based on selecting the rows where at least one element is non-zero. to the cross product of this one, you add the colmeans multiplied by the number of all-zero rows. This is equivalent to
( X - E[X] ) times ( X - E[X] ) transposed
Divide by n-1 and you have your variance-covariance matrix. The rest is easy.
A test case :
X <- sample(0:10,1e8,replace=T,p=c(0.99,rep(0.001,10)))
xx <- Matrix(X,ncol=5)
> system.time(out1 <- sparse.cor(xx))
user system elapsed
0.50 0.09 0.59
> system.time(out2 <- cor(as.matrix(xx)))
user system elapsed
1.75 0.28 2.05
> all.equal(out1,out2)
[1] TRUE