Running cor() (or any variant) over a sparse matrix in R

后端 未结 3 890
情话喂你
情话喂你 2020-12-13 14:16

When trying to run the cor() function on sparse matrices (of either type dgCMatrix or dgTMatrix) I get the following error:

Error in cor(x) : su         


        
3条回答
  •  旧巷少年郎
    2020-12-13 14:48

    The answer was solved elegantly by @Ron but a slight modification to the solution is a little cleaner and also returns the sample covariance matrix.

    sparse.cor4 <- function(x){
        n <- nrow(x)
        cMeans <- colMeans(x)
        covmat <- (as.matrix(crossprod(x)) - n*tcrossprod(cMeans))/(n-1)
        sdvec <- sqrt(diag(covmat)) 
        cormat <- covmat/tcrossprod(sdvec)
        list(cov=covmat,cor=cormat)
    }
    

    The simplification comes from this: with an n x p matrix X, and an n x p matrix M of the column means of X:

    cov(X) = E[(X-M)'(X-M)] = E[X'X - M'X - X'M + M'M] 
    
    M'X = X'M = M'M, which have (i,j) elements = sum(column i) * sum(column j) / n
    
    = n * mean(column i) * mean(column j)
    

    or written with a row vector m of the column means,

    = n * m'm
    

    Then cov(X) = E[X'X - n m'm]


    and it is now a smidge faster.

    > X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
    > x <- Matrix(X,ncol=10)
    > system.time(corx <- sparse.cor(x))
       user  system elapsed 
      1.139   0.196   1.334 
    > system.time(corx3 <- sparse.cor3(x))
       user  system elapsed 
      0.194   0.007   0.201 
    > system.time(corx4 <- sparse.cor4(x))
       user  system elapsed 
      0.187   0.007   0.194 
    > system.time(correg <-cor(as.matrix(x)))
       user  system elapsed 
      0.341   0.067   0.407 
    > system.time(covreg <- cov(as.matrix(x)))
       user  system elapsed 
      0.314   0.016   0.330 
    > all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
    [1] TRUE
    > all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
    [1] TRUE
    > all.equal(c(as.matrix(corx4$cor)),c(as.matrix(correg)))
    [1] TRUE
    > all.equal(c(as.matrix(corx4$cov)),c(as.matrix(covreg)))
    [1] TRUE
    

提交回复
热议问题