Vectorized groupby with NumPy

前端未结

关注

 4  2191

自闭症患者 2020-12-31 08:11

Pandas has a widely-used groupby facility to split up a DataFrame based on a corresponding mapping, from which you can apply a calculation on each subgroup and recombine the

4条回答

再見小時候 (楼主)

2020-12-31 09:02

@klim's sparse matrix solution would at first sight appear to be tied to summation. We can, however, use it in the general case by converting between the csr and csc formats:

Let's look at a small example:

>>> m, n = 3, 8                                                                                                     
>>> idx = np.random.randint(0, m, (n,))
>>> data = np.arange(n)
>>>                                                                                                                 
>>> M = sparse.csr_matrix((data, idx, np.arange(n+1)), (n, m))                                                      
>>>                                                                                                                 
>>> idx                                                                                                             
array([0, 2, 2, 1, 1, 2, 2, 0])                                                                                     
>>> 
>>> M = M.tocsc()
>>> 
>>> M.indptr, M.indices
(array([0, 2, 4, 8], dtype=int32), array([0, 7, 3, 4, 1, 2, 5, 6], dtype=int32))

As we can see after conversion the internal representation of the sparse matrix yields the indices grouped and sorted:

>>> groups = np.split(M.indices, M.indptr[1:-1])
>>> groups
[array([0, 7], dtype=int32), array([3, 4], dtype=int32), array([1, 2, 5, 6], dtype=int32)]
>>>

We could have obtained the same using a stable argsort:

>>> np.argsort(idx, kind='mergesort')
array([0, 7, 3, 4, 1, 2, 5, 6])
>>>

But sparse matrices are actually faster, even when we allow argsort to use a faster non-stable algorithm:

>>> m, n = 1000, 100000
>>> idx = np.random.randint(0, m, (n,))
>>> data = np.arange(n)
>>> 
>>> timeit('sparse.csr_matrix((data, idx, np.arange(n+1)), (n, m)).tocsc()', **kwds)
2.250748165184632
>>> timeit('np.argsort(idx)', **kwds)
5.783584725111723

If we require argsort to keep groups sorted, the difference is even larger:

>>> timeit('np.argsort(idx, kind="mergesort")', **kwds)
10.507467685034499

0 讨论(0)

查看其它4个回答