问题
Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like numpy.corrcoef
that will work on a scipy sparse matrix.
回答1:
You can compute the correlation coefficients fairly straightforwardly from the covariance matrix like this:
import numpy as np
from scipy import sparse
def sparse_corrcoef(A, B=None):
if B is not None:
A = sparse.vstack((A, B), format='csr')
A = A.astype(np.float64)
n = A.shape[1]
# Compute the covariance matrix
rowsum = A.sum(1)
centering = rowsum.dot(rowsum.T.conjugate()) / n
C = (A.dot(A.T.conjugate()) - centering) / (n - 1)
# The correlation coefficients are given by
# C_{i,j} / sqrt(C_{i} * C_{j})
d = np.diag(C)
coeffs = C / np.sqrt(np.outer(d, d))
return coeffs
Check that it works OK:
# some smallish sparse random matrices
a = sparse.rand(100, 100000, density=0.1, format='csr')
b = sparse.rand(100, 100000, density=0.1, format='csr')
coeffs1 = sparse_corrcoef(a, b)
coeffs2 = np.corrcoef(a.todense(), b.todense())
print(np.allclose(coeffs1, coeffs2))
# True
Be warned:
The amount of memory required for computing the covariance matrix C
will be heavily dependent on the sparsity structure of A
(and B
, if given). For example, if A
is an (m, n)
matrix containing just a single column of non-zero values then C
will be an (n, n)
matrix containing all non-zero values. If n
is large then this could be very bad news in terms of memory consumption.
回答2:
Just using numpy:
import numpy as np
C=((A.T*A -(sum(A).T*sum(A)/N))/(N-1)).todense()
V=np.sqrt(np.mat(np.diag(C)).T*np.mat(np.diag(C)))
COV = np.divide(C,V+1e-119)
来源:https://stackoverflow.com/questions/19231268/correlation-coefficients-for-sparse-matrix-in-python