When trying to run the cor() function on sparse matrices (of either type dgCMatrix or dgTMatrix) I get the following error:
Error in cor(x) : su
The answer was solved elegantly by @Ron but a slight modification to the solution is a little cleaner and also returns the sample covariance matrix.
sparse.cor4 <- function(x){
n <- nrow(x)
cMeans <- colMeans(x)
covmat <- (as.matrix(crossprod(x)) - n*tcrossprod(cMeans))/(n-1)
sdvec <- sqrt(diag(covmat))
cormat <- covmat/tcrossprod(sdvec)
list(cov=covmat,cor=cormat)
}
The simplification comes from this: with an n x p matrix X, and an n x p matrix M of the column means of X:
cov(X) = E[(X-M)'(X-M)] = E[X'X - M'X - X'M + M'M]
M'X = X'M = M'M, which have (i,j) elements = sum(column i) * sum(column j) / n
= n * mean(column i) * mean(column j)
or written with a row vector m of the column means,
= n * m'm
Then cov(X) = E[X'X - n m'm]
and it is now a smidge faster.
> X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> system.time(corx <- sparse.cor(x))
user system elapsed
1.139 0.196 1.334
> system.time(corx3 <- sparse.cor3(x))
user system elapsed
0.194 0.007 0.201
> system.time(corx4 <- sparse.cor4(x))
user system elapsed
0.187 0.007 0.194
> system.time(correg <-cor(as.matrix(x)))
user system elapsed
0.341 0.067 0.407
> system.time(covreg <- cov(as.matrix(x)))
user system elapsed
0.314 0.016 0.330
> all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx4$cor)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx4$cov)),c(as.matrix(covreg)))
[1] TRUE