Can I perform dynamic cumsum of rows in pandas?

后端 未结 3 1898
别跟我提以往
别跟我提以往 2020-12-01 22:49

If I have the following dataframe, derived like so: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 1)))

    0
0   0
1   2
2   8
3   1
4            


        
3条回答
  •  情书的邮戳
    2020-12-01 23:22

    The loop cannot be avoided, but it can be parallelized using numba's njit:

    from numba import njit, prange
    
    @njit
    def dynamic_cumsum(seq, index, max_value):
        cumsum = []
        running = 0
        for i in prange(len(seq)):
            if running > max_value:
                cumsum.append([index[i], running])
                running = 0
            running += seq[i] 
        cumsum.append([index[-1], running])
    
        return cumsum
    

    The index is required here, assuming your index is not numeric/monotonically increasing.

    %timeit foo(df, 5)
    1.24 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    %timeit dynamic_cumsum(df.iloc(axis=1)[0].values, df.index.values, 5)
    77.2 µs ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    If the index is of Int64Index type, you can shorten this to:

    @njit
    def dynamic_cumsum2(seq, max_value):
        cumsum = []
        running = 0
        for i in prange(len(seq)):
            if running > max_value:
                cumsum.append([i, running])
                running = 0
            running += seq[i] 
        cumsum.append([i, running])
    
        return cumsum
    
    lst = dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
    pd.DataFrame(lst, columns=['A', 'B']).set_index('A')
    
        B
    A    
    3  10
    7   8
    9   4
    

    %timeit foo(df, 5)
    1.23 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    %timeit dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
    71.4 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    njit Functions Performance

    perfplot.show(
        setup=lambda n: pd.DataFrame(np.random.randint(0, 10, size=(n, 1))),
        kernels=[
            lambda df: list(cumsum_limit_nb(df.iloc[:, 0].values, 5)),
            lambda df: dynamic_cumsum2(df.iloc[:, 0].values, 5)
        ],
        labels=['cumsum_limit_nb', 'dynamic_cumsum2'],
        n_range=[2**k for k in range(0, 17)],
        xlabel='N',
        logx=True,
        logy=True,
        equality_check=None # TODO - update when @jpp adds in the final `yield`
    )
    

    The log-log plot shows that the generator function is faster for larger inputs:

    A possible explanation is that, as N increases, the overhead of appending to a growing list in dynamic_cumsum2 becomes prominent. While cumsum_limit_nb just has to yield.

提交回复
热议问题