Rolling grouped cumulative sum

随声附和 提交于 2019-12-04 09:14:58

This is a very interesting problem. Try below to see if it works for you.

(
    pd.concat([df.loc[df.Per<=i][['C','V']].assign(Per=i) for i in df.Per.unique()])
    .groupby(by=['Per','C'])
    .sum()
    .reset_index()
)

Out[197]: 
    Per  C   V
0     1  a   4
1     1  c   4
2     2  a  10
3     2  b   5
4     2  c   4
5     3  a  10
6     3  b   5
7     3  c   4
8     3  j   7
9     4  a  19
10    4  b   5
11    4  c   4
12    4  j   7
13    4  x  11
14    5  a  21
15    5  b   5
16    5  c   4
17    5  j   7
18    5  x  11
19    6  a  21
20    6  b   5
21    6  c   7
22    6  j   7
23    6  k   6
24    6  x  11

If you set the index to be 'Per' and 'C', you can first accumulate over those index levels. Then I decided to reindex the resulting series by the the product of the index levels in order to get all possibilities while filling in new indices with zero.

After this, I use groupby, cumsum, and remove zeros.

s = df.set_index(['Per', 'C']).V.sum(level=[0, 1])

s.reindex(
    pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
    fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()

    Per  C   V
0     1  a   4
1     1  c   4
2     2  a  10
3     2  b   5
4     2  c   4
5     3  a  10
6     3  b   5
7     3  c   4
8     3  j   7
9     4  a  19
10    4  b   5
11    4  c   4
12    4  j   7
13    4  x  11
14    5  a  21
15    5  b   5
16    5  c   4
17    5  j   7
18    5  x  11
19    6  a  21
20    6  b   5
21    6  c   7
22    6  j   7
23    6  k   6
24    6  x  11

You could use pivot_table/cumsum:

(df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
   .fillna(0) 
   .cumsum(axis=0)
   .replace(0, np.nan)
   .stack().reset_index())

yields

    Per  C     0
0     1  a   4.0
1     1  c   4.0
2     2  a  10.0
3     2  b   5.0
4     2  c   4.0
5     3  a  10.0
6     3  b   5.0
7     3  c   4.0
8     3  j   7.0
9     4  a  19.0
10    4  b   5.0
11    4  c   4.0
12    4  j   7.0
13    4  x  11.0
14    5  a  21.0
15    5  b   5.0
16    5  c   4.0
17    5  j   7.0
18    5  x  11.0
19    6  a  21.0
20    6  b   5.0
21    6  c   7.0
22    6  j   7.0
23    6  k   6.0
24    6  x  11.0

On the plus side, I think the pivot_table/cumsum approach helps convey the meaning of the calculation pretty well. Given the pivot_table, the calculation is essentially a cumulative sum down each column:

In [131]: df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
Out[131]: 
C      a    b    c    j    k     x
Per                               
1    4.0  NaN  4.0  NaN  NaN   NaN
2    6.0  5.0  NaN  NaN  NaN   NaN
3    NaN  NaN  NaN  7.0  NaN   NaN
4    9.0  NaN  NaN  NaN  NaN  11.0
5    2.0  NaN  NaN  NaN  NaN   NaN
6    NaN  NaN  3.0  NaN  6.0   NaN

On the negative side, the need to fuss with 0's and NaNs is not ideal. We need 0's for the cumsum, but we need NaNs to make unwanted rows to disappear when the DataFrame is stacked.

The pivot_table/cumsum approach also offers a considerable speed advantage over using_concat, but piRSquared's solution is the fastest. On a 1000-row df:

In [169]: %timeit using_reindex2(df)
100 loops, best of 3: 6.86 ms per loop

In [152]: %timeit using_reindex(df)
100 loops, best of 3: 8.36 ms per loop

In [80]: %timeit using_pivot(df)
100 loops, best of 3: 8.58 ms per loop

In [79]: %timeit using_concat(df)
10 loops, best of 3: 84 ms per loop

Here is the setup I used for the benchmark:

import numpy as np
import pandas as pd

def using_pivot(df):
    return (df.pivot_table(index='P', columns='C', values='V', aggfunc='sum')
             .fillna(0)
             .cumsum(axis=0)
             .replace(0, np.nan)
             .stack().reset_index())

def using_reindex(df):
    """
    https://stackoverflow.com/a/49097572/190597 (piRSquared)
    """
    s = df.set_index(['P', 'C']).V.sum(level=[0, 1])

    return s.reindex(
        pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
        fill_value=0
    ).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()

def using_reindex2(df):
    """
    https://stackoverflow.com/a/49097572/190597 (piRSquared)
    with first line changed
    """
    s = df.groupby(['P', 'C'])['V'].sum()
    return s.reindex(
        pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
        fill_value=0
    ).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()

def using_concat(df):
    """
    https://stackoverflow.com/a/49095139/190597 (Allen)
    """
    return (pd.concat([df.loc[df.P<=i][['C','V']].assign(P=i) 
                       for i in df.P.unique()])
              .groupby(by=['P','C'])
              .sum()
              .reset_index())

def make(nrows):
    df = pd.DataFrame(np.random.randint(50, size=(nrows,3)), columns=list('PCV'))
    return df

df = make(1000)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!