Fast slicing and multiplication of scipy sparse CSR matrix

北慕城南 提交于 2020-01-24 01:46:48

问题


I have a scipy sparse CSR matrix of size 2M x 50k with 200M non-zero values (100 per row). I need to slice 120k rows of it by a (randomly distributed) index (which is a pandas Series) and then multiply that submatrix by a sparse vector of size 1x50k (with 100 non-zero values as well).

I do this:

slice = matrix[index.tolist(), :]
result = slice.dot(vector.T).T.toarray()[0]  # returns 1x120k array

The slicing takes 0.7s (slow) and then multiplication takes 0.05s.

Instead, I can multiply the whole matrix first and then slice the result:

result = matrix.dot(vector.T).T.toarray()[0]
result_sliced = result[index.tolist()]  # returns 1x120k array

In this case, multiplication takes 0.65s and then slicing takes 0.015s.

Questions:

  1. Why is slicing of a CSR matrix by rows so slow? Even multiplication of the whole matrix takes less time than it.

  2. Is there a way to achieve the end result faster?


回答1:


I explained in Sparse matrix slicing using list of int that this kind of row indexing is actually performed with matrix multiplication. In effect it constructs a sparse vector with 1's for the desired rows, and does the appropriate dot.

So I'm not surprised that the order of the operations doesn't matter much.

In general sparse matrices are not designed for efficient indexing. They don't, for example, return views. The csr matrix multiplication is one its most efficient operations. Even row or columns sums are performed with matrix multiplication.




回答2:


I ran into the same issue and my solution was to write a row extractor that relies on indexing of numpy arrays rather than sparse matrix multiplications. See my approach here.




回答3:


If you have enough RAM memory you can convert tolil() and slicing should be faster.



来源:https://stackoverflow.com/questions/42127046/fast-slicing-and-multiplication-of-scipy-sparse-csr-matrix

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!