Can I perform dynamic cumsum of rows in pandas?

后端 未结 3 1886
别跟我提以往
别跟我提以往 2020-12-01 22:49

If I have the following dataframe, derived like so: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 1)))

    0
0   0
1   2
2   8
3   1
4            


        
相关标签:
3条回答
  • 2020-12-01 23:08

    A loop isn't necessarily bad. The trick is to make sure it's performed on low-level objects. In this case, you can use Numba or Cython. For example, using a generator with numba.njit:

    from numba import njit
    
    @njit
    def cumsum_limit(A, limit=5):
        count = 0
        for i in range(A.shape[0]):
            count += A[i]
            if count > limit:
                yield i, count
                count = 0
    
    idx, vals = zip(*cumsum_limit(df[0].values))
    res = pd.Series(vals, index=idx)
    

    To demonstrate the performance benefits of JIT-compiling with Numba:

    import pandas as pd, numpy as np
    from numba import njit
    
    df = pd.DataFrame({0: [0, 2, 8, 1, 0, 0, 7, 0, 2, 2]})
    
    @njit
    def cumsum_limit_nb(A, limit=5):
        count = 0
        for i in range(A.shape[0]):
            count += A[i]
            if count > limit:
                yield i, count
                count = 0
    
    def cumsum_limit(A, limit=5):
        count = 0
        for i in range(A.shape[0]):
            count += A[i]
            if count > limit:
                yield i, count
                count = 0
    
    n = 10**4
    df = pd.concat([df]*n, ignore_index=True)
    
    %timeit list(cumsum_limit_nb(df[0].values))  # 4.19 ms ± 90.4 µs per loop
    %timeit list(cumsum_limit(df[0].values))     # 58.3 ms ± 194 µs per loop
    
    0 讨论(0)
  • 2020-12-01 23:22

    The loop cannot be avoided, but it can be parallelized using numba's njit:

    from numba import njit, prange
    
    @njit
    def dynamic_cumsum(seq, index, max_value):
        cumsum = []
        running = 0
        for i in prange(len(seq)):
            if running > max_value:
                cumsum.append([index[i], running])
                running = 0
            running += seq[i] 
        cumsum.append([index[-1], running])
    
        return cumsum
    

    The index is required here, assuming your index is not numeric/monotonically increasing.

    %timeit foo(df, 5)
    1.24 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    %timeit dynamic_cumsum(df.iloc(axis=1)[0].values, df.index.values, 5)
    77.2 µs ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    If the index is of Int64Index type, you can shorten this to:

    @njit
    def dynamic_cumsum2(seq, max_value):
        cumsum = []
        running = 0
        for i in prange(len(seq)):
            if running > max_value:
                cumsum.append([i, running])
                running = 0
            running += seq[i] 
        cumsum.append([i, running])
    
        return cumsum
    
    lst = dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
    pd.DataFrame(lst, columns=['A', 'B']).set_index('A')
    
        B
    A    
    3  10
    7   8
    9   4
    

    %timeit foo(df, 5)
    1.23 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    %timeit dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
    71.4 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    njit Functions Performance

    perfplot.show(
        setup=lambda n: pd.DataFrame(np.random.randint(0, 10, size=(n, 1))),
        kernels=[
            lambda df: list(cumsum_limit_nb(df.iloc[:, 0].values, 5)),
            lambda df: dynamic_cumsum2(df.iloc[:, 0].values, 5)
        ],
        labels=['cumsum_limit_nb', 'dynamic_cumsum2'],
        n_range=[2**k for k in range(0, 17)],
        xlabel='N',
        logx=True,
        logy=True,
        equality_check=None # TODO - update when @jpp adds in the final `yield`
    )
    

    The log-log plot shows that the generator function is faster for larger inputs:

    A possible explanation is that, as N increases, the overhead of appending to a growing list in dynamic_cumsum2 becomes prominent. While cumsum_limit_nb just has to yield.

    0 讨论(0)
  • 2020-12-01 23:31

    simpler approach:

    def dynamic_cumsum(seq,limit):
        res=[]
        cs=seq.cumsum()
        for i, e in enumerate(cs):
            if cs[i] >limit:
                res.append([i,e])
                cs[i+1:] -= e
        if res[-1][0]==i:
            return res
        res.append([i,e])
        return res
    

    result:

    x=dynamic_cumsum(df[0].values,5)
    x
    >>[[2, 10], [6, 8], [9, 4]]
    
    0 讨论(0)
提交回复
热议问题