Efficient incremental sparse matrix in python / scipy / numpy

问题

Is there a way in Python to have an efficient incremental update of sparse matrix?

 H = lil_matrix((n,m))
 for (i,j) in zip(A,B):
   h(i,j) += compute_something

It seems that such a way to build a sparse matrix is quite slow (lil_matrix is the fastest sparse matrix type for that).

Is there a way (like using dict of dict or other kind of approaches) to efficiently build the sparse matrix H?

回答1:

In https://stackoverflow.com/a/27771335/901925 I explore incremental matrix assignment.

lol and dok are the recommended formats if you want to change values. csr will give you an efficiency warning, and coo does not allow indexing.

But I also found that dok indexing is slow compared to regular dictionary indexing. So for many changes it is better to build a plain dictionary (with the same tuple indexing), and build the dok matrix from that.

But if you can calculate the H data values with a fast numpy vector operation, as opposed to iteration, it is best to do so, and construct the sparse matrix from that (e.g. coo format). In fact even with iteration this would be faster:

 h = np.zeros(A.shape)
 for k, (i,j) in enumerate(zip(A,B)):
    h[k] = compute_something 
 H = sparse.coo_matrix((h, (A, B)), shape=(n,m))

e.g.

In [780]: A=np.array([0,1,1,2]); B=np.array([0,2,2,1])
In [781]: h=np.zeros(A.shape)
In [782]: for k, (i,j) in enumerate(zip(A,B)):
    h[k] = i+j+k
   .....:     
In [783]: h
Out[783]: array([ 0.,  4.,  5.,  6.])
In [784]: M=sparse.coo_matrix((h,(A,B)),shape=(4,4))
In [785]: M
Out[785]: 
<4x4 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in COOrdinate format>
In [786]: M.A
Out[786]: 
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  9.,  0.],
       [ 0.,  6.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

Note that the (1,2) value is the sum 4+5. That's part of the coo to csr conversion.

In this case I could have calculated h with:

In [791]: A+B+np.arange(A.shape[0])
Out[791]: array([0, 4, 5, 6])

so there's no need for iteration.

回答2:

Nope, do not use csr_matrix or csc_matrix, as they are going to be even more slower than lil_matrix, if you construct them incrementally. The Dictionary of Key based sparse matrix is exactly what you are looking for

from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
    for j in range(5):
        S[i,j] = i+j    # Update elements

回答3:

A faster way would be:

H_ij = compute_something_vectorized()
H = coo_matrix((H_ij, (A, B))).tocsr()

The data for duplicate coordinates are then summed, see the docs for coo_matrix.

来源：https://stackoverflow.com/questions/35773101/efficient-incremental-sparse-matrix-in-python-scipy-numpy

标签

python

numpy

matrix

scipy

sparse-matrix