Sum a range of cells in a single column in pandas dataframe

问题

I have three columns in a DataFrame. I want to take the number in the Streak_Count column and sum up that number of cells from the returns in the MON TOTAL. The result is displayed in the WANTED RESULT as shown below. The issue I cant figure out is summing the number of cells which can be any number>> in this example between 1 and 4.

              MON TOTAL STREAK_COUNT    WANTED RESULT
1/2/1992       1.123077       1          1.123077 (only 1 so 1.12)
2/3/1992      -1.296718       0 
3/2/1992      -6.355612       2          -7.65233 (sum of -1.29 and -6.35)
4/1/1992       5.634692       0 
5/1/1992       4.180605       2          9.815297 (sum of 5.63 and 4.18)
7/1/1992      -0.101016       0 
8/3/1992      -0.706125       2         -0.807141 (sum of -.10 and -.706)
10/1/1992      0.368579       0 
11/2/1992      3.822277       0 
1/4/1993       2.233359       0 
2/1/1993       15.219644      4         21.643859
3/1/1993       -2.647693      1         -2.647693
4/1/1993       1.599094       1         1.599094

回答1:

It's all about finding the right thing to group by. In this case, a reversed cumulative sum of STREAK_COUNT will give you what you want.

First we create the dataframe:

import pandas as pd

>>> df = pd.DataFrame({'MON TOTAL':[1.123077, -1.296178, -6.355612, 5.634692, 4.180605, -0.101016, -0.706125,
                                    0.368579, 3.822277, 2.233359, 15.219644, -2.647693, 1.599094],
                       'STREAK_COUNT':[1, 0, 2, 0, 2, 0, 2, 0, 0, 0, 4, 1, 1]},
                      index=['1/2/1992', '2/3/1992', '3/2/1992', '4/1/1992', '5/1/1992', '7/1/1992', '8/3/1992',
                             '10/1/1992', '11/2/1992', '1/4/1993', '2/1/1993', '3/1/1993', '4/1/1993'])
>>> df
           MON TOTAL  STREAK_COUNT
1/2/1992    1.123077             1
2/3/1992   -1.296178             0
3/2/1992   -6.355612             2
4/1/1992    5.634692             0
5/1/1992    4.180605             2
7/1/1992   -0.101016             0
8/3/1992   -0.706125             2
10/1/1992   0.368579             0
11/2/1992   3.822277             0
1/4/1993    2.233359             0
2/1/1993   15.219644             4
3/1/1993   -2.647693             1
4/1/1993    1.599094             1

Next find the groups, compute the sum of each group, and join the results to the original dataframe:

>>> groups = df['STREAK_COUNT'][::-1].cumsum()[::-1]
>>> df['RESULT'] = df.groupby(groups)['MON TOTAL'].transform('sum')
>>> df
           MON TOTAL  STREAK_COUNT     RESULT
1/2/1992    1.123077             1   1.123077
2/3/1992   -1.296178             0  -7.651790
3/2/1992   -6.355612             2  -7.651790
4/1/1992    5.634692             0   9.815297
5/1/1992    4.180605             2   9.815297
7/1/1992   -0.101016             0  -0.807141
8/3/1992   -0.706125             2  -0.807141
10/1/1992   0.368579             0  21.643859
11/2/1992   3.822277             0  21.643859
1/4/1993    2.233359             0  21.643859
2/1/1993   15.219644             4  21.643859
3/1/1993   -2.647693             1  -2.647693
4/1/1993    1.599094             1   1.599094

If you just want results for the end of each streak, then use a mask to filter it:

>>> df[df['STREAK_COUNT'] > 0]
          MON TOTAL  STREAK_COUNT     RESULT
1/2/1992   1.123077             1   1.123077
3/2/1992  -6.355612             2  -7.651790
5/1/1992   4.180605             2   9.815297
8/3/1992  -0.706125             2  -0.807141
2/1/1993  15.219644             4  21.643859
3/1/1993  -2.647693             1  -2.647693
4/1/1993   1.599094             1   1.599094

回答2:

EDIT 1:

Efficient version that performs operations on entire column rather than iterating through and utilizes object/iterator thus avoiding lists.

The changes here are that,

1) We first get Series for "STREAK_COUNT" and convert to tuple with index as first value and "STREAK_COUNT" as second using enumerate. Since enumerate is an iterable, we can use it directly in step(3) instead of converting to list.

2) Define getTotal() function and process to each value in the (1) independently to generate corresponding "RESULT" value.

3) map getTotal() to entire "STREAK_COUNT" enumerate object generated in step(1) to get desired "RESULT" column.

Working Code

def getTotal(x):        
        pos = [i for i in range(x[0],x[0]-x[1],-1)]
        total = sum(df.iloc[pos, 0])
        func = lambda x : x if x !=0 else ''
        return func(total)


df[ "RESULT" ] = map(lambda x: getTotal(x), enumerate(df["STREAK_COUNT"]))
print df

Older Version

Not as efficient since we need to iterate through rows and columns, but works for any values within STREAK_COUNT and more clear to understand.

Sample Code

import pandas as pd

df = pd.read_csv("sample.csv", index_col = 0)

#iterate over rows with index
for idx, (lab, row) in  enumerate(df.iterrows()):
        #get current STREAK_COUNT value
    count = int(df.iloc[idx, 1])

    #get previous positions based on count
    pos = [i for i in range(idx,idx-count,-1)]

    #count total
    total = sum(df.iloc[pos, 0])

    #create new column value based on total
    func = lambda x : x if x !=0 else ''
    df.loc[lab, "RESULT"] = func(total)

print df

RESULT

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>> 
           MON TOTAL  STREAK_COUNT    RESULT
1/2/1992    1.123077             1   1.12308
2/3/1992   -1.296718             0          
3/2/1992   -6.355612             2  -7.65233
4/1/1992    5.634692             0          
5/1/1992    4.180605             2    9.8153
7/1/1992   -0.101016             0          
8/3/1992   -0.706125             2 -0.807141
10/1/1992   0.368579             0          
11/2/1992   3.822277             0          
1/4/1993    2.233359             0          
2/1/1993   15.219644             4   21.6439
3/1/1993   -2.647693             1  -2.64769
4/1/1993    1.599094             1   1.59909
>>>

来源：https://stackoverflow.com/questions/48133983/sum-a-range-of-cells-in-a-single-column-in-pandas-dataframe

标签

python

pandas

sum