Running cor() (or any variant) over a sparse matrix in R

后端 未结 3 889
情话喂你
情话喂你 2020-12-13 14:16

When trying to run the cor() function on sparse matrices (of either type dgCMatrix or dgTMatrix) I get the following error:

Error in cor(x) : su         


        
3条回答
  •  轮回少年
    2020-12-13 14:53

    EDITED ANSWER - optimized for memory use and speed.

    Your error is logic, as a sparse matrix is not recognized by the cor function as a matrix, and there is -yet- no method for correlations in the Matrix package.

    There is no function I am aware of that will let you calculate this, but you can easily calculate that yourself, using the matrix operators that are available in the Matrix package :

    sparse.cor <- function(x){
      n <- nrow(x)
      m <- ncol(x)
      ii <- unique(x@i)+1 # rows with a non-zero element
    
      Ex <- colMeans(x)
      nozero <- as.vector(x[ii,]) - rep(Ex,each=length(ii))        # colmeans
    
      covmat <- ( crossprod(matrix(nozero,ncol=m)) +
                  crossprod(t(Ex))*(n-length(ii))
                )/(n-1)
      sdvec <- sqrt(diag(covmat))
      covmat/crossprod(t(sdvec))
    }
    

    the covmat is your variance-covariance matrix, so you can calculate that one as well. The calculation is based on selecting the rows where at least one element is non-zero. to the cross product of this one, you add the colmeans multiplied by the number of all-zero rows. This is equivalent to

    ( X - E[X] ) times ( X - E[X] ) transposed

    Divide by n-1 and you have your variance-covariance matrix. The rest is easy.

    A test case :

    X <- sample(0:10,1e8,replace=T,p=c(0.99,rep(0.001,10)))
    xx <- Matrix(X,ncol=5)
    
    > system.time(out1 <- sparse.cor(xx))
       user  system elapsed 
       0.50    0.09    0.59 
    > system.time(out2 <- cor(as.matrix(xx)))
       user  system elapsed 
       1.75    0.28    2.05 
    > all.equal(out1,out2)
    [1] TRUE
    

提交回复
热议问题