python pandas conditional cumulative sum

前端未结

关注

 3  1079

Consider my dataframe df

data  data_binary  sum_data
  2       1            1
  5       0            0
  1       1            1
  4       1


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  暖寄归人        
                
              
                            
                2020-12-15 11:05
              
            
            
                                                                       
I think you can groupby with DataFrameGroupBy.cumsum by Series, where first compare next value by shifted column if not equal (!=) and then create groups by cumsum. Last replace 0 by column data_binary with mask:

print (df.data_binary.ne(df.data_binary.shift()).cumsum())
0    1
1    2
2    3
3    3
4    3
5    4
6    4
7    5
Name: data_binary, dtype: int32

df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum())
                                .cumsum()
df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0)
print (df)
   data  data_binary  sum_data  sum_data1
0     2            1         1          1
1     5            0         0          0
2     1            1         1          1
3     4            1         2          2
4     3            1         3          3
5    10            0         0          0
6     7            0         0          0
7     3            1         1          1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不思量自难忘°        
                
              
                            
                2020-12-15 11:07
              
            
            
                                                                       
you want to take the cumulative sum of data_binary and subtract the most recent cumulative sum where data_binary was zero.

b = df.data_binary
c = b.cumsum()
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)

0    1
1    0
2    1
3    2
4    3
5    0
6    0
7    1
Name: data_binary, dtype: int64




Explanation  

Let's start by looking at each step side by side

cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
print(pd.concat([
        b, c,
        c.mask(b != 0),
        c.mask(b != 0).ffill(),
        c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
    ], axis=1, keys=cols))


   data_binary  cumulative_sum  nan_non_zero  forward_fill  final_result
0            1               1           NaN           NaN             1
1            0               1           1.0           1.0             0
2            1               2           NaN           1.0             1
3            1               3           NaN           1.0             2
4            1               4           NaN           1.0             3
5            0               4           4.0           4.0             0
6            0               4           4.0           4.0             0
7            1               5           NaN           4.0             1


The problem with cumulative_sum is that the rows where data_binary is zero, do not reset the sum.  And that is the motivation for this solution.  How do we "reset" the sum when data_binary is zero?  Easy!  I slice the cumulative sum where data_binary is zero and forward fill the values.  When I take the difference between this and the cumulative sum, I've effectively reset the sum.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2020-12-15 11:08
              
            
            
                                                                       
If you want the excellent piRSquared's answer in just one single command:

df['sum_data'] = df[['data_binary']].apply(
    lambda x: x.cumsum().sub(x.cumsum().mask(x != 0).ffill(), fill_value=0).astype(int), 
    axis=0)


Note that the double squared bracket on the right hand side is necessary to make a one-column DataFrame instead of a Series in order to use apply with the axis argument (which is not available when apply is used on Series).  
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复