问题
I have a big csr_matrix(1M*1K) and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. Actually my problem is exactly same as this Sum over rows in scipy.sparse.csr_matrix. The only thing is I find the accepted solution to be slow for my purpose. Let me state what I have
map_fn = np.random.randint(0, 10000, 1000000)
map_fn
here tells me how my input rows(1M) are mapped into my output rows(10K). For example ith input row gets added up into map_fn[i]
output row. I tried the two approaches mentioned in the above question,
namely forming a sparse matrix and using sparse sum. Although the sparse matrix approach looks way better than sparse sum approach but I find it slow for my purpose. Here is the code comparing two approaches:
import scipy.sparse
import numpy as np
import time
print "Setting up input"
s=10000
n=1000000
d=1000
density=1.0/500
X=scipy.sparse.rand(n,d,density=density,format="csr")
map_fn=np.random.randint(0, s, n)
# Approach 1
start_time=time.time()
col = scipy.arange(n)
val = np.ones(n)
S = scipy.sparse.csr_matrix( (val, (map_fn, col)), shape = (s,n))
print "Approach 1 Creation time : ",time.time()-start_time
SX = S.dot(X)
print "Approach 1 Total time : ",time.time()-start_time
#Approach 2
start_time=time.time()
SX = np.zeros((s,X.shape[1]))
for i in range(SX.shape[0]):
SX[i,:] = X[np.where(map_fn==i)[0],:].sum(axis=0)
print "Approach 2 Total time : ",time.time()-start_time
which gives following numbers:
Approach 1 Creation time : 0.187678098679
Approach 1 Total time : 0.286989927292
Approach 2 Total time : 10.208632946
So my question is this is there a better way of doing this? I find forming sparse matrix to be an overkill as it takes more than half of the time. Are there any better alternatives? Any suggestions are greatly appreciated. Thanks
回答1:
Starting approach
Adapting sparse solution from this post -
def sparse_matrix_mult_sparseX_mod1(X, rows):
nrows = rows.max()+1
ncols = X.shape[1]
nelem = nrows * ncols
a,b = X.nonzero()
ids = rows[a] + b*nrows
sums = np.bincount(ids, X[a,b].A1, minlength=nelem)
out = sums.reshape(ncols,-1).T
return out
Benchmarking
Original approach #1 -
def app1(X, map_fn):
col = scipy.arange(n)
val = np.ones(n)
S = scipy.sparse.csr_matrix( (val, (map_fn, col)), shape = (s,n))
SX = S.dot(X)
return SX
Timings and verification -
In [209]: # Inputs setup
...: s=10000
...: n=1000000
...: d=1000
...: density=1.0/500
...:
...: X=scipy.sparse.rand(n,d,density=density,format="csr")
...: map_fn=np.random.randint(0, s, n)
...:
In [210]: out1 = app1(X, map_fn)
...: out2 = sparse_matrix_mult_sparseX_mod1(X, map_fn)
...: print np.allclose(out1.toarray(), out2)
...:
True
In [211]: %timeit app1(X, map_fn)
1 loop, best of 3: 517 ms per loop
In [212]: %timeit sparse_matrix_mult_sparseX_mod1(X, map_fn)
10 loops, best of 3: 147 ms per loop
To be fair, we should time the final dense array version from app1
-
In [214]: %timeit app1(X, map_fn).toarray()
1 loop, best of 3: 584 ms per loop
Porting to Numba
We could translate the binned counting step to numba, which might be beneficial for denser input matrices. One of the ways to do so would be -
from numba import njit
@njit
def bincount_mod2(out, rows, r, C, V):
N = len(V)
for i in range(N):
out[rows[r[i]], C[i]] += V[i]
return out
def sparse_matrix_mult_sparseX_mod2(X, rows):
nrows = rows.max()+1
ncols = X.shape[1]
r,C = X.nonzero()
V = X[r,C].A1
out = np.zeros((nrows, ncols))
return bincount_mod2(out, rows, r, C, V)
Timings -
In [373]: # Inputs setup
...: s=10000
...: n=1000000
...: d=1000
...: density=1.0/100 # Denser now!
...:
...: X=scipy.sparse.rand(n,d,density=density,format="csr")
...: map_fn=np.random.randint(0, s, n)
...:
In [374]: %timeit app1(X, map_fn)
1 loop, best of 3: 787 ms per loop
In [375]: %timeit sparse_matrix_mult_sparseX_mod1(X, map_fn)
1 loop, best of 3: 906 ms per loop
In [376]: %timeit sparse_matrix_mult_sparseX_mod2(X, map_fn)
1 loop, best of 3: 705 ms per loop
With the dense output from app1
-
In [379]: %timeit app1(X, map_fn).toarray()
1 loop, best of 3: 910 ms per loop
来源:https://stackoverflow.com/questions/46475895/fastest-way-to-sum-over-rows-of-sparse-matrix