scipy sparse matrix sum results in a dense matrix - how to enforce result sparseness?

问题

Summing over one axis of a scipy.sparse.csr_matrix results in a numpy.matrix object. Given that my sparse matrix is really sparse, I find extremely annoying this behaviour.

Here is an example:

dense = [[ 0.,  0.,  0.,  0.,  0.],
         [ 1.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.],
         [ 2.,  0.,  4.,  0.,  0.]]


from scipy.sparse import csr_matrix
sparse = csr_matrix(dense)

print(sparse.sum(1))

with result:

matrix([[ 0.],
        [ 1.],
        [ 0.],
        [ 0.],
        [ 6.]])

How can I enforce sparseness in the sum over columns operation without implicitly converting the matrix to dense format? In this example I've just used a small n matrix, but my matrix is far larger, and sparser, so its a large waste of space to pass through dense representation.

回答1:

sparse performs the sum with a matrix multiplication:

In [136]: np.matrix(np.ones(M.shape[1]))@M                                      
Out[136]: matrix([[3., 0., 4., 0., 0.]])
In [137]: M@np.matrix(np.ones((M.shape[1],1)))                                  
Out[137]: 
matrix([[0.],
        [1.],
        [0.],
        [0.],
        [6.]])
In [138]: timeit M@np.matrix(np.ones((M.shape[1],1)))                           
91.5 µs ± 268 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [139]: timeit M.sum(1)                                                       
96.6 µs ± 647 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The times are similar. Both produce an np.matrix result.

If multiply with a 2d array instead, I get an array result, and somewhat surprisingly a much better time:

In [140]: timeit M@np.ones((M.shape[1],1))                                      
24.4 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [141]: M@np.ones((M.shape[1],1))                                             
Out[141]: 
array([[0.],
       [1.],
       [0.],
       [0.],
       [6.]])

I could put that array back into a sparse matrix - but at a time cost:

In [142]: csr_matrix(M@np.ones((M.shape[1],1)))                                 
Out[142]: 
<5x1 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>
In [143]: timeit csr_matrix(M@np.ones((M.shape[1],1)))                          
391 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Or we could create a sparse matrix first:

In [144]: M@csr_matrix(np.ones((M.shape[1],1)))                                 
Out[144]: 
<5x1 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>
In [145]: timeit M@csr_matrix(np.ones((M.shape[1],1)))                          
585 µs ± 5.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Even removing the extractor matrix creation from the loop results in a slower speed:

In [146]: %%timeit m1 = csr_matrix(np.ones((M.shape[1],1))) 
     ...: M@m1                                                                     
227 µs ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

sum like this (nearly) always increases the density of the result. Matrices with atleast one nonzero value per row are more common than ones with many pure zero rows. Timings in your real world case might be different, but trying to same memory might not buy you that much.

If I look in more detail at the csr matrix produced by the sparse matrix multiplication:

In [147]: res = M@csr_matrix(np.ones((M.shape[1],1)))                           
In [148]: res                                                                   
Out[148]: 
<5x1 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>
In [149]: res.indptr                                                            
Out[149]: array([0, 0, 1, 1, 1, 2], dtype=int32)
In [150]: res.indices                                                           
Out[150]: array([0, 0], dtype=int32)

The indptr array has one value per row (+1), So the memory use of this column matrix is actually higher than the dense equivalent. The same res in csc format would be more compact, with a 2 element indptr.

It is also possible to work directly with the indptr , indices, data attributes of the csr matrix, essentially iterating on rows and summing each, and from that create a new sparse matrix. In some cases we've achieved speed improvements compared to the sparse methods. But you have to understand that data storage, and be smart about the whole thing.

来源：https://stackoverflow.com/questions/62217382/scipy-sparse-matrix-sum-results-in-a-dense-matrix-how-to-enforce-result-sparse

标签

matrix

scipy

sparse-matrix