Running cor() (or any variant) over a sparse matrix in R

后端 未结 3 887
情话喂你
情话喂你 2020-12-13 14:16

When trying to run the cor() function on sparse matrices (of either type dgCMatrix or dgTMatrix) I get the following error:

Error in cor(x) : su         


        
相关标签:
3条回答
  • 2020-12-13 14:48

    The answer was solved elegantly by @Ron but a slight modification to the solution is a little cleaner and also returns the sample covariance matrix.

    sparse.cor4 <- function(x){
        n <- nrow(x)
        cMeans <- colMeans(x)
        covmat <- (as.matrix(crossprod(x)) - n*tcrossprod(cMeans))/(n-1)
        sdvec <- sqrt(diag(covmat)) 
        cormat <- covmat/tcrossprod(sdvec)
        list(cov=covmat,cor=cormat)
    }
    

    The simplification comes from this: with an n x p matrix X, and an n x p matrix M of the column means of X:

    cov(X) = E[(X-M)'(X-M)] = E[X'X - M'X - X'M + M'M] 
    
    M'X = X'M = M'M, which have (i,j) elements = sum(column i) * sum(column j) / n
    
    = n * mean(column i) * mean(column j)
    

    or written with a row vector m of the column means,

    = n * m'm
    

    Then cov(X) = E[X'X - n m'm]


    and it is now a smidge faster.

    > X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
    > x <- Matrix(X,ncol=10)
    > system.time(corx <- sparse.cor(x))
       user  system elapsed 
      1.139   0.196   1.334 
    > system.time(corx3 <- sparse.cor3(x))
       user  system elapsed 
      0.194   0.007   0.201 
    > system.time(corx4 <- sparse.cor4(x))
       user  system elapsed 
      0.187   0.007   0.194 
    > system.time(correg <-cor(as.matrix(x)))
       user  system elapsed 
      0.341   0.067   0.407 
    > system.time(covreg <- cov(as.matrix(x)))
       user  system elapsed 
      0.314   0.016   0.330 
    > all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
    [1] TRUE
    > all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
    [1] TRUE
    > all.equal(c(as.matrix(corx4$cor)),c(as.matrix(correg)))
    [1] TRUE
    > all.equal(c(as.matrix(corx4$cov)),c(as.matrix(covreg)))
    [1] TRUE
    
    0 讨论(0)
  • 2020-12-13 14:53

    EDITED ANSWER - optimized for memory use and speed.

    Your error is logic, as a sparse matrix is not recognized by the cor function as a matrix, and there is -yet- no method for correlations in the Matrix package.

    There is no function I am aware of that will let you calculate this, but you can easily calculate that yourself, using the matrix operators that are available in the Matrix package :

    sparse.cor <- function(x){
      n <- nrow(x)
      m <- ncol(x)
      ii <- unique(x@i)+1 # rows with a non-zero element
    
      Ex <- colMeans(x)
      nozero <- as.vector(x[ii,]) - rep(Ex,each=length(ii))        # colmeans
    
      covmat <- ( crossprod(matrix(nozero,ncol=m)) +
                  crossprod(t(Ex))*(n-length(ii))
                )/(n-1)
      sdvec <- sqrt(diag(covmat))
      covmat/crossprod(t(sdvec))
    }
    

    the covmat is your variance-covariance matrix, so you can calculate that one as well. The calculation is based on selecting the rows where at least one element is non-zero. to the cross product of this one, you add the colmeans multiplied by the number of all-zero rows. This is equivalent to

    ( X - E[X] ) times ( X - E[X] ) transposed

    Divide by n-1 and you have your variance-covariance matrix. The rest is easy.

    A test case :

    X <- sample(0:10,1e8,replace=T,p=c(0.99,rep(0.001,10)))
    xx <- Matrix(X,ncol=5)
    
    > system.time(out1 <- sparse.cor(xx))
       user  system elapsed 
       0.50    0.09    0.59 
    > system.time(out2 <- cor(as.matrix(xx)))
       user  system elapsed 
       1.75    0.28    2.05 
    > all.equal(out1,out2)
    [1] TRUE
    
    0 讨论(0)
  • 2020-12-13 15:04

    This is what I ended up using. Thanks @Joris for all the help!

    My x matrix is quite big. Assuming it's size is n * p, n=200k and p=10k in my case.

    The trick is to maintain the sparsity of the operations and perform the calculations on p * p matrices instead of p*n x n*p.

    Version 1, is more straightforward, but less efficient on time and memory, as the outer product operation is expensive:

    sparse.cor2 <- function(x){
        n <- nrow(x)
    
        covmat <- (crossprod(x)-2*(colMeans(x) %o% colSums(x))
                  +n*colMeans(x)%o%colMeans(x))/(n-1)
    
        sdvec <- sqrt(diag(covmat)) # standard deviations of columns
        covmat/crossprod(t(sdvec)) # correlation matrix
    }
    

    Version 2 is more efficient on time (saves several operations) and on memory. Still requires huge amounts of memory for a p=10k matrix:

    sparse.cor3 <- function(x){
        memory.limit(size=10000)
        n <- nrow(x)
    
        cMeans <- colMeans(x)
        cSums <- colSums(x)
    
        # Calculate the population covariance matrix.
        # There's no need to divide by (n-1) as the std. dev is also calculated the same way.
        # The code is optimized to minize use of memory and expensive operations
        covmat <- tcrossprod(cMeans, (-2*cSums+n*cMeans))
        crossp <- as.matrix(crossprod(x))
        covmat <- covmat+crossp
    
        sdvec <- sqrt(diag(covmat)) # standard deviations of columns
        covmat/crossprod(t(sdvec)) # correlation matrix
    }
    

    Timing comparisons (sparse.cor is @Joris latest version):

    > X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
    > x <- Matrix(X,ncol=10)
    > 
    > object.size(x)
    11999472 bytes
    > 
    > system.time(corx <- sparse.cor(x))
       user  system elapsed 
       0.50    0.06    0.56 
    > system.time(corx2 <- sparse.cor2(x))
       user  system elapsed 
       0.17    0.00    0.17 
    > system.time(corx3 <- sparse.cor3(x))
       user  system elapsed 
       0.13    0.00    0.12 
    > system.time(correg <-cor(as.matrix(x)))
       user  system elapsed 
       0.25    0.03    0.29 
    > all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
    [1] TRUE
    > all.equal(c(as.matrix(corx2)),c(as.matrix(correg)))
    [1] TRUE
    > all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
    [1] TRUE
    

    Much larger x matrix:

    > X <- sample(0:10,1e8,replace=T,p=c(0.9,rep(0.01,10)))
    > x <- Matrix(X,ncol=10)
    > object.size(x)
    120005688 bytes
    > system.time(corx2 <- sparse.cor2(x))
       user  system elapsed 
       1.47    0.07    1.53 
    > system.time(corx3 <- sparse.cor3(x))
       user  system elapsed 
       1.18    0.09    1.29 
    > system.time(corx <- sparse.cor(x))
       user  system elapsed 
       5.43    1.26    6.71
    
    0 讨论(0)
提交回复
热议问题