python pandas conditional cumulative sum

前端 未结 3 1079
一个人的身影
一个人的身影 2020-12-15 10:49

Consider my dataframe df

data  data_binary  sum_data
  2       1            1
  5       0            0
  1       1            1
  4       1              


        
相关标签:
3条回答
  • 2020-12-15 11:05

    I think you can groupby with DataFrameGroupBy.cumsum by Series, where first compare next value by shifted column if not equal (!=) and then create groups by cumsum. Last replace 0 by column data_binary with mask:

    print (df.data_binary.ne(df.data_binary.shift()).cumsum())
    0    1
    1    2
    2    3
    3    3
    4    3
    5    4
    6    4
    7    5
    Name: data_binary, dtype: int32
    
    df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum())
                                    .cumsum()
    df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0)
    print (df)
       data  data_binary  sum_data  sum_data1
    0     2            1         1          1
    1     5            0         0          0
    2     1            1         1          1
    3     4            1         2          2
    4     3            1         3          3
    5    10            0         0          0
    6     7            0         0          0
    7     3            1         1          1
    
    0 讨论(0)
  • 2020-12-15 11:07

    you want to take the cumulative sum of data_binary and subtract the most recent cumulative sum where data_binary was zero.

    b = df.data_binary
    c = b.cumsum()
    c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
    
    0    1
    1    0
    2    1
    3    2
    4    3
    5    0
    6    0
    7    1
    Name: data_binary, dtype: int64
    

    Explanation

    Let's start by looking at each step side by side

    cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
    print(pd.concat([
            b, c,
            c.mask(b != 0),
            c.mask(b != 0).ffill(),
            c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
        ], axis=1, keys=cols))
    
    
       data_binary  cumulative_sum  nan_non_zero  forward_fill  final_result
    0            1               1           NaN           NaN             1
    1            0               1           1.0           1.0             0
    2            1               2           NaN           1.0             1
    3            1               3           NaN           1.0             2
    4            1               4           NaN           1.0             3
    5            0               4           4.0           4.0             0
    6            0               4           4.0           4.0             0
    7            1               5           NaN           4.0             1
    

    The problem with cumulative_sum is that the rows where data_binary is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when data_binary is zero? Easy! I slice the cumulative sum where data_binary is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.

    0 讨论(0)
  • 2020-12-15 11:08

    If you want the excellent piRSquared's answer in just one single command:

    df['sum_data'] = df[['data_binary']].apply(
        lambda x: x.cumsum().sub(x.cumsum().mask(x != 0).ffill(), fill_value=0).astype(int), 
        axis=0)
    

    Note that the double squared bracket on the right hand side is necessary to make a one-column DataFrame instead of a Series in order to use apply with the axis argument (which is not available when apply is used on Series).

    0 讨论(0)
提交回复
热议问题