问题
Summing over one axis of a scipy.sparse.csr_matrix results in a numpy.matrix object. Given that my sparse matrix is really sparse, I find extremely annoying this behaviour.
Here is an example:
dense = [[ 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 2., 0., 4., 0., 0.]]
from scipy.sparse import csr_matrix
sparse = csr_matrix(dense)
print(sparse.sum(1))
with result:
matrix([[ 0.],
[ 1.],
[ 0.],
[ 0.],
[ 6.]])
How can I enforce sparseness in the sum over columns operation without implicitly converting the matrix to dense format?
In this example I've just used a small n
matrix, but my matrix is far larger, and sparser, so its a large waste of space to pass through dense representation.
回答1:
sparse
performs the sum with a matrix multiplication:
In [136]: np.matrix(np.ones(M.shape[1]))@M
Out[136]: matrix([[3., 0., 4., 0., 0.]])
In [137]: M@np.matrix(np.ones((M.shape[1],1)))
Out[137]:
matrix([[0.],
[1.],
[0.],
[0.],
[6.]])
In [138]: timeit M@np.matrix(np.ones((M.shape[1],1)))
91.5 µs ± 268 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [139]: timeit M.sum(1)
96.6 µs ± 647 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The times are similar. Both produce an np.matrix
result.
If multiply with a 2d array instead, I get an array result, and somewhat surprisingly a much better time:
In [140]: timeit M@np.ones((M.shape[1],1))
24.4 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [141]: M@np.ones((M.shape[1],1))
Out[141]:
array([[0.],
[1.],
[0.],
[0.],
[6.]])
I could put that array back into a sparse matrix - but at a time cost:
In [142]: csr_matrix(M@np.ones((M.shape[1],1)))
Out[142]:
<5x1 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>
In [143]: timeit csr_matrix(M@np.ones((M.shape[1],1)))
391 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Or we could create a sparse matrix first:
In [144]: M@csr_matrix(np.ones((M.shape[1],1)))
Out[144]:
<5x1 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>
In [145]: timeit M@csr_matrix(np.ones((M.shape[1],1)))
585 µs ± 5.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Even removing the extractor matrix creation from the loop results in a slower speed:
In [146]: %%timeit m1 = csr_matrix(np.ones((M.shape[1],1)))
...: M@m1
227 µs ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
sum
like this (nearly) always increases the density of the result. Matrices with atleast one nonzero value per row are more common than ones with many pure zero rows. Timings in your real world case might be different, but trying to same memory might not buy you that much.
If I look in more detail at the csr
matrix produced by the sparse matrix multiplication:
In [147]: res = M@csr_matrix(np.ones((M.shape[1],1)))
In [148]: res
Out[148]:
<5x1 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>
In [149]: res.indptr
Out[149]: array([0, 0, 1, 1, 1, 2], dtype=int32)
In [150]: res.indices
Out[150]: array([0, 0], dtype=int32)
The indptr
array has one value per row (+1), So the memory use of this column matrix is actually higher than the dense equivalent. The same res
in csc
format would be more compact, with a 2 element indptr
.
It is also possible to work directly with the indptr
, indices
, data
attributes of the csr
matrix, essentially iterating on rows and summing each, and from that create a new sparse matrix. In some cases we've achieved speed improvements compared to the sparse
methods. But you have to understand that data storage, and be smart about the whole thing.
来源:https://stackoverflow.com/questions/62217382/scipy-sparse-matrix-sum-results-in-a-dense-matrix-how-to-enforce-result-sparse