I\'d like to write a function that normalizes the rows of a large sparse matrix (such that they sum to one).
from pylab import *
import scipy.sparse as sp
d
Without importing sklearn, converting to dense or multiplying matrices and by exploiting the data representation of csr matrices:
from scipy.sparse import isspmatrix_csr
def normalize(W):
""" row normalize scipy sparse csr matrices inplace.
"""
if not isspmatrix_csr(W):
raise ValueError('W must be in CSR format.')
else:
for i in range(W.shape[0]):
row_sum = W.data[W.indptr[i]:W.indptr[i+1]].sum()
if row_sum != 0:
W.data[W.indptr[i]:W.indptr[i+1]] /= row_sum
Remember that W.indices is the array of column indices,
W.data is the array of corresponding nonzero values
and W.indptr points to row starts in indices and data.
You can add a numpy.abs() when taking the sum if you need the L1 norm or use numpy.max() to normalize by the maximum value per row.