Add new columns to a pandas df after filtering

问题

I have a df that contains information about various places.

import pandas as pd

d = ({
    'C' : ['08:00:00','XX','08:10:00','XX','08:41:42','XX','08:50:00','XX', '09:00:00', 'XX','09:15:00','XX','09:21:00','XX','09:30:00','XX','09:40:00','XX'],
    'D' : ['Home','','Home','','Away','','Home','','Away','','Home','','Home','','Away','','Home',''],
    'E' : ['Num:','','Num:','','Num:','','Num:','','Num:', '','Num:','','Num:','','Num:', '','Num:', ''],
    'F' : ['1','','1','','1','','1','','1', '','2','','2','','1', '','2',''],   
    'A' : ['A','','A','','A','','A','','A','','A','','A','','A','','A',''],           
    'B' : ['Stop','','Res','','Stop','','Start','','Res','','Stop','','Res','','Start','','Start','']
    })

df = pd.DataFrame(data=d)

I want to export that data into their respective places, which are labelled in Column D. I also want to add new columns based off functions labelled in Column B.

df['C'] = pd.to_timedelta(df['C'], errors="coerce").dt.total_seconds()

incl = ['Home', 'Away']    

for k, g in df[df.D.isin(incl)].groupby('D'):
    Stop = g.loc[df['B'] == 'Stop'].reset_index()['C']
    Start = g.loc[df['B'] == 'Start'].reset_index()['C']
    Res = g.loc[df['B'] == 'Res'].reset_index()['C']

    g['Start_diff'] = Start - Stop
    g['Res_diff'] = Start - Res

The problem is these functions occur multiple times, which are labelled in Column F. So if we look at the export for Home we get the diff for the first time in Column F.

Output:

    A   B       C       D       E       F   Start_diff  Res_diff
0   A   Stop    28800   Home    Num:    1   3000        2400
2   A   Res     29400   Home    Num:    1       
6   A   Start   31800   Home    Num:    1       
10  A   Stop    33300   Home    Num:    2       
12  A   Res     33660   Home    Num:    2       
16  A   Start   34800   Home    Num:    2

Whereas I'm hoping the intended output would be:

    A   B       C       D       E       F   Start_diff  Res_diff
0   A   Stop    28800   Home    Num:    1   3000        2400
2   A   Res     29400   Home    Num:    1       
6   A   Start   31800   Home    Num:    1       
10  A   Stop    33300   Home    Num:    2   1500        1200    
12  A   Res     33660   Home    Num:    2       
16  A   Start   34800   Home    Num:    2

I have tried to alter for k, g in df[df.D.isin(incl)].groupby('D'): to for k, g in df[df.D.isin(incl)].groupby('D').F.nunique():

But I get an error TypeError: 'numpy.int64' object is not iterable

回答1:

I believe need custom function with groupby by D and F columns with replace duplicated values by mask:

def f(g):
    Stop = g.loc[df['B'] == 'Stop', 'C']
    Start = g.loc[df['B'] == 'Start', 'C']
    Res = g.loc[df['B'] == 'Res', 'C']
    g['Start_diff'] = Start.values[0] - Stop.values[0]
    g['Res_diff'] = Start.values[0] - Res.values[0]

    return (g)

df = df[df.D.isin(incl)].groupby(['D', 'F']).apply(f)

df[['Start_diff', 'Res_diff']] = df[['Start_diff', 'Res_diff']].mask(df.duplicated(['D','F']))
print (df)
          C     D     E  F  A      B  Start_diff  Res_diff
0   28800.0  Home  Num:  1  A   Stop      3000.0    2400.0
2   29400.0  Home  Num:  1  A    Res         NaN       NaN
4   31302.0  Away  Num:  1  A   Stop      2898.0    1800.0
6   31800.0  Home  Num:  1  A  Start         NaN       NaN
8   32400.0  Away  Num:  1  A    Res         NaN       NaN
10  33300.0  Home  Num:  2  A   Stop      1500.0    1140.0
12  33660.0  Home  Num:  2  A    Res         NaN       NaN
14  34200.0  Away  Num:  1  A  Start         NaN       NaN
16  34800.0  Home  Num:  2  A  Start         NaN       NaN

来源：https://stackoverflow.com/questions/50831061/add-new-columns-to-a-pandas-df-after-filtering

标签

python

pandas

dataframe

group-by

unique