efficient way to iterate through coo_matrix elements ordered by column?

时光总嘲笑我的痴心妄想 提交于 2019-12-11 17:58:15

问题


I have a scipy.sparse.coo_matrix matrix which I want to convert to bitsets per column for further calculation. (for the purpose of the example, I'm testing on 100Kx1M).

I'm currently doing something like this:

bitsets = [ intbitset() for _ in range(matrix.shape[1]) ]
for i,j in itertools.izip(matrix.row, matrix.col):
  bitsets[j].add(i)

That works, but COO matrix iterates the values by row. Ideally, I'd like to iterate by columns and then just build the bitset at once instead of adding to a different bitset every time.

Couldn't find a way to iterate the matrix column-based. Is there?

I don't mind converting to other sparse formats, but couldn't find a way to efficiently iterate the matrix there. (using nonzero() on CSC matrix has been proven to be extremely not efficient...)

Thanks!


回答1:


Make a small sparse matrix:

In [82]: M = sparse.random(5,5,.2, 'coo')*2
In [83]: M
Out[83]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in COOrdinate format>
In [84]: print(M)
  (1, 3)    0.03079661961875302
  (0, 2)    0.722023291734881
  (0, 3)    0.547594065264775
  (1, 0)    1.1021150713641839
  (1, 2)    0.585848976928308

That print, as well as the nonzero return the row and col arrays:

In [85]: M.nonzero()
Out[85]: (array([1, 0, 0, 1, 1], dtype=int32), array([3, 2, 3, 0, 2], dtype=int32))

Conversion to csr orders the rows (but not necessarily the columns). nonzero converts back to coo and returns the row and col, with the new order.

In [86]: M.tocsr().nonzero()
Out[86]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))

I was going to say conversion to csc orders the columns, but it doesn't look like that:

In [87]: M.tocsc().nonzero()
Out[87]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))

Transpose of csr produces a csc:

In [88]: M.tocsr().T.nonzero()
Out[88]: (array([0, 2, 2, 3, 3], dtype=int32), array([1, 0, 1, 0, 1], dtype=int32))

I don't fully follow what you are trying to do, or why you want a column sort, but the lil format might help:

In [90]: M.tolil().rows
Out[90]: 
array([list([2, 3]), list([0, 2, 3]), list([]), list([]), list([])],
      dtype=object)
In [91]: M.tolil().T.rows
Out[91]: 
array([list([1]), list([]), list([0, 1]), list([0, 1]), list([])],
      dtype=object)

In general iteration on sparse matrices is slow. Matrix multiplication in the csr and csc formats is the fastest operation. And many other operations make use of that indirectly (e.g. row sum). Another relatively fast set of operations are ones that can work directly with the data attribute, without paying attention to row or column values.

coo doesn't implement indexing or iteration. csr and lil implement those.



来源:https://stackoverflow.com/questions/48936510/efficient-way-to-iterate-through-coo-matrix-elements-ordered-by-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!