expanding (adding a row or column) a scipy.sparse matrix

六眼飞鱼酱① 提交于 2019-11-28 18:14:47

I don't think that there is any way to really escape from doing the copying. Both of those types of sparse matrices store their data as Numpy arrays (in the data and indices attributes for csr and in the data and rows attributes for lil) internally and Numpy arrays can't be extended.

Update with more information:

LIL does stand for LInked List, but the current implementation doesn't quite live up to the name. The Numpy arrays used for data and rows are both of type object. Each of the objects in these arrays are actually Python lists (an empty list when all values are zero in a row). Python lists aren't exactly linked lists, but they are kind of close and quite frankly a better choice due to O(1) look-up. Personally, I don't immediately see the point of using a Numpy array of objects here rather than just a Python list. You could fairly easily change the current lil implementation to use Python lists instead which would allow you to add a row without copying the whole matrix.

Scipy doesn't have a way to do this without copying the data but you can do it yourself by changing the attributes that define the sparse matrix.

There are 4 attributes that make up the csr_matrix:

data: An array containing the actual values in the matrix

indices: An array containing the column index corresponding to each value in data

indptr: An array that specifies the index before the first value in data for each row. If the row is empty then the index is the same as the previous column.

shape: A tuple containing the shape of the matrix

If you are simply adding a row of zeros to the bottom all you have to do is change the shape and indptr for your matrix.

x = np.ones((3,5))
x = csr_matrix(x)
x.toarray()
>> array([[ 1.,  1.,  1.,  1.,  1.],
          [ 1.,  1.,  1.,  1.,  1.],
          [ 1.,  1.,  1.,  1.,  1.]])
# reshape is not implemented for csr_matrix but you can cheat and do it  yourself.
x._shape = (4,5)
# Update indptr to let it know we added a row with nothing in it. So just append the last
# value in indptr to the end.
# note that you are still copying the indptr array
x.indptr = np.hstack((x.indptr,x.indptr[-1]))
x.toarray()
array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  0.,  0.,  0.]])

Here is a function to handle the more general case of vstacking any 2 csr_matrices. You still end up copying the underlying numpy arrays but it is still significantly faster than the scipy vstack method.

def csr_vappend(a,b):
    """ Takes in 2 csr_matrices and appends the second one to the bottom of the first one. 
    Much faster than scipy.sparse.vstack but assumes the type to be csr and overwrites
    the first matrix instead of copying it. The data, indices, and indptr still get copied."""

    a.data = np.hstack((a.data,b.data))
    a.indices = np.hstack((a.indices,b.indices))
    a.indptr = np.hstack((a.indptr,(b.indptr + a.nnz)[1:]))
    a._shape = (a.shape[0]+b.shape[0],b.shape[1])
    return a

Not sure if you're still looking for a solution, but maybe others can look into hstack and vstack - http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html. I think we can define a csr_matrix for the single additional row and then vstack it with the previous matrix.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!