Efficient way to normalize a Scipy Sparse Matrix

后端 未结 5 1751
孤独总比滥情好
孤独总比滥情好 2020-12-29 20:49

I\'d like to write a function that normalizes the rows of a large sparse matrix (such that they sum to one).

from pylab import *
import scipy.sparse as sp

d         


        
5条回答
  •  萌比男神i
    2020-12-29 21:31

    While Aarons answer is correct, I implemented a solution when I wanted to normalize with respect to the maximum of the absolute values, which sklearn is not offering. My method uses the nonzero entries and finds them in the csr_matrix.data array to replace values there quickly.

    def normalize_sparse(csr_matrix):
        nonzero_rows = csr_matrix.nonzero()[0]
        for idx in np.unique(nonzero_rows):
            data_idx = np.where(nonzero_rows==idx)[0]
            abs_max = np.max(np.abs(csr_matrix.data[data_idx]))
            if abs_max != 0:
                csr_matrix.data[data_idx] = 1./abs_max * csr_matrix.data[data_idx]
    

    In contrast to sunan's solution, this method does not require any casting of the matrix into dense format (which could raise memory problems) and no matrix multiplications either. I tested the method on a sparse matrix of shape (35'000, 486'000) and it took ~ 18 seconds.

提交回复
热议问题