Vectorized groupby with NumPy

前端 未结 4 2191
自闭症患者
自闭症患者 2020-12-31 08:11

Pandas has a widely-used groupby facility to split up a DataFrame based on a corresponding mapping, from which you can apply a calculation on each subgroup and recombine the

4条回答
  •  再見小時候
    2020-12-31 09:02

    @klim's sparse matrix solution would at first sight appear to be tied to summation. We can, however, use it in the general case by converting between the csr and csc formats:

    Let's look at a small example:

    >>> m, n = 3, 8                                                                                                     
    >>> idx = np.random.randint(0, m, (n,))
    >>> data = np.arange(n)
    >>>                                                                                                                 
    >>> M = sparse.csr_matrix((data, idx, np.arange(n+1)), (n, m))                                                      
    >>>                                                                                                                 
    >>> idx                                                                                                             
    array([0, 2, 2, 1, 1, 2, 2, 0])                                                                                     
    >>> 
    >>> M = M.tocsc()
    >>> 
    >>> M.indptr, M.indices
    (array([0, 2, 4, 8], dtype=int32), array([0, 7, 3, 4, 1, 2, 5, 6], dtype=int32))
    

    As we can see after conversion the internal representation of the sparse matrix yields the indices grouped and sorted:

    >>> groups = np.split(M.indices, M.indptr[1:-1])
    >>> groups
    [array([0, 7], dtype=int32), array([3, 4], dtype=int32), array([1, 2, 5, 6], dtype=int32)]
    >>> 
    

    We could have obtained the same using a stable argsort:

    >>> np.argsort(idx, kind='mergesort')
    array([0, 7, 3, 4, 1, 2, 5, 6])
    >>> 
    

    But sparse matrices are actually faster, even when we allow argsort to use a faster non-stable algorithm:

    >>> m, n = 1000, 100000
    >>> idx = np.random.randint(0, m, (n,))
    >>> data = np.arange(n)
    >>> 
    >>> timeit('sparse.csr_matrix((data, idx, np.arange(n+1)), (n, m)).tocsc()', **kwds)
    2.250748165184632
    >>> timeit('np.argsort(idx)', **kwds)
    5.783584725111723
    

    If we require argsort to keep groups sorted, the difference is even larger:

    >>> timeit('np.argsort(idx, kind="mergesort")', **kwds)
    10.507467685034499
    

提交回复
热议问题