What is the fastest way to slice a scipy.sparse matrix?

后端 未结 2 374
情歌与酒
情歌与酒 2020-12-25 15:11

I normally use

matrix[:, i:]

It seems not work as fast as I expected.

相关标签:
2条回答
  • 2020-12-25 15:28

    I've found that the advertised fast row indexing of scipy.sparse.csr_matrix can be made a lot quicker by rolling your own row indexer. Here's the idea:

    class SparseRowIndexer:
        def __init__(self, csr_matrix):
            data = []
            indices = []
            indptr = []
    
            # Iterating over the rows this way is significantly more efficient
            # than csr_matrix[row_index,:] and csr_matrix.getrow(row_index)
            for row_start, row_end in zip(csr_matrix.indptr[:-1], csr_matrix.indptr[1:]):
                 data.append(csr_matrix.data[row_start:row_end])
                 indices.append(csr_matrix.indices[row_start:row_end])
                 indptr.append(row_end-row_start) # nnz of the row
    
            self.data = np.array(data)
            self.indices = np.array(indices)
            self.indptr = np.array(indptr)
            self.n_columns = csr_matrix.shape[1]
    
        def __getitem__(self, row_selector):
            data = np.concatenate(self.data[row_selector])
            indices = np.concatenate(self.indices[row_selector])
            indptr = np.append(0, np.cumsum(self.indptr[row_selector]))
    
            shape = [indptr.shape[0]-1, self.n_columns]
    
            return sparse.csr_matrix((data, indices, indptr), shape=shape)
    

    That is, it is possible to utilize the fast indexing of numpy arrays by storing the non-zero values of each row in separate arrays (with a different length for each row) and putting all of those row arrays in an object-typed array (allowing each row to have a different size) that can be indexed efficiently. The column indices are stored the same way. The approach is slightly different to the standard CSR data structure which stores all non-zero values in a single array, requiring look-ups to see where each row starts and ends. These look-ups can slow down random access but should be efficient for retrieval of contiguous rows.

    Profiling results

    My matrix mat is a 1,900,000x1,250,000 csr_matrix with 400,000,000 non-zero elements. ilocs is an array of 200,000 random row indices.

    >>> %timeit mat[ilocs]
    2.66 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    compared to:

    >>> row_indexer = SparseRowIndexer(mat)
    >>> %timeit row_indexer[ilocs]
    59.9 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    The SparseRowIndexer seems to be faster when using fancy indexing compared to boolean masks.

    0 讨论(0)
  • 2020-12-25 15:34

    If you want to obtain a sparse matrix as output the fastest way to do row slicing is to have a csr type, and for columns slicing csc, as detailed here. In both cases you just have to do what you are currently doing:

    matrix[l1:l2,c1:c2]
    

    If you want another type as output there maybe faster ways. In this other answer it is explained many methods for slicing a matrix and their different timings compared. For example, if you want a ndarray as output the fastest slicing is:

    matrix.A[l1:l2,c1:c2] 
    

    or:

    matrix.toarray()[l1:l2,c1:c2]
    

    much faster than:

    matrix[l1:l2,c1:c2].A #or .toarray()
    
    0 讨论(0)
提交回复
热议问题