Can I perform dynamic cumsum of rows in pandas?

后端 未结 3 1896
别跟我提以往
别跟我提以往 2020-12-01 22:49

If I have the following dataframe, derived like so: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 1)))

    0
0   0
1   2
2   8
3   1
4            


        
3条回答
  •  醉酒成梦
    2020-12-01 23:08

    A loop isn't necessarily bad. The trick is to make sure it's performed on low-level objects. In this case, you can use Numba or Cython. For example, using a generator with numba.njit:

    from numba import njit
    
    @njit
    def cumsum_limit(A, limit=5):
        count = 0
        for i in range(A.shape[0]):
            count += A[i]
            if count > limit:
                yield i, count
                count = 0
    
    idx, vals = zip(*cumsum_limit(df[0].values))
    res = pd.Series(vals, index=idx)
    

    To demonstrate the performance benefits of JIT-compiling with Numba:

    import pandas as pd, numpy as np
    from numba import njit
    
    df = pd.DataFrame({0: [0, 2, 8, 1, 0, 0, 7, 0, 2, 2]})
    
    @njit
    def cumsum_limit_nb(A, limit=5):
        count = 0
        for i in range(A.shape[0]):
            count += A[i]
            if count > limit:
                yield i, count
                count = 0
    
    def cumsum_limit(A, limit=5):
        count = 0
        for i in range(A.shape[0]):
            count += A[i]
            if count > limit:
                yield i, count
                count = 0
    
    n = 10**4
    df = pd.concat([df]*n, ignore_index=True)
    
    %timeit list(cumsum_limit_nb(df[0].values))  # 4.19 ms ± 90.4 µs per loop
    %timeit list(cumsum_limit(df[0].values))     # 58.3 ms ± 194 µs per loop
    

提交回复
热议问题