What is the fastest way to slice a scipy.sparse matrix?

后端 未结 2 383
情歌与酒
情歌与酒 2020-12-25 15:11

I normally use

matrix[:, i:]

It seems not work as fast as I expected.

2条回答
  •  醉话见心
    2020-12-25 15:28

    I've found that the advertised fast row indexing of scipy.sparse.csr_matrix can be made a lot quicker by rolling your own row indexer. Here's the idea:

    class SparseRowIndexer:
        def __init__(self, csr_matrix):
            data = []
            indices = []
            indptr = []
    
            # Iterating over the rows this way is significantly more efficient
            # than csr_matrix[row_index,:] and csr_matrix.getrow(row_index)
            for row_start, row_end in zip(csr_matrix.indptr[:-1], csr_matrix.indptr[1:]):
                 data.append(csr_matrix.data[row_start:row_end])
                 indices.append(csr_matrix.indices[row_start:row_end])
                 indptr.append(row_end-row_start) # nnz of the row
    
            self.data = np.array(data)
            self.indices = np.array(indices)
            self.indptr = np.array(indptr)
            self.n_columns = csr_matrix.shape[1]
    
        def __getitem__(self, row_selector):
            data = np.concatenate(self.data[row_selector])
            indices = np.concatenate(self.indices[row_selector])
            indptr = np.append(0, np.cumsum(self.indptr[row_selector]))
    
            shape = [indptr.shape[0]-1, self.n_columns]
    
            return sparse.csr_matrix((data, indices, indptr), shape=shape)
    

    That is, it is possible to utilize the fast indexing of numpy arrays by storing the non-zero values of each row in separate arrays (with a different length for each row) and putting all of those row arrays in an object-typed array (allowing each row to have a different size) that can be indexed efficiently. The column indices are stored the same way. The approach is slightly different to the standard CSR data structure which stores all non-zero values in a single array, requiring look-ups to see where each row starts and ends. These look-ups can slow down random access but should be efficient for retrieval of contiguous rows.

    Profiling results

    My matrix mat is a 1,900,000x1,250,000 csr_matrix with 400,000,000 non-zero elements. ilocs is an array of 200,000 random row indices.

    >>> %timeit mat[ilocs]
    2.66 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    compared to:

    >>> row_indexer = SparseRowIndexer(mat)
    >>> %timeit row_indexer[ilocs]
    59.9 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    The SparseRowIndexer seems to be faster when using fancy indexing compared to boolean masks.

提交回复
热议问题