Add unique groups to DF for each row including sum from other columns

问题

I got a DatFrame looking like this:

ID     field_1     area_1    field_2       area_2    field_3     area_3    field_4      area_4
1      scoccer     500       basketball    200       swimming    100       basketball   50
2      volleyball  100       np.nan        np.nan    np.nan      np.nan    np.nan       np.nan
3      basketball  1000      football      10        np.nan      np.nan    np.nan       np.nan
4      swimming    280       swimming      200       basketball  320       np.nan       np.nan
5      volleyball  110       football      160       volleyball  30        np.nan       np.nan

The original DataFrame has the same structure but containing columns field_1 up to field_30 as well as area_1 to area_30.

I'd like to add columns to the DF with horiztonal groups depending on the distinct expressions in 'field_x' with the correspondig area summed up... the added columns should look like this:

ID   group_1     area_1     group_2     area_2     group_3    area_3
        
1    scoccer     500        basketball  250        swimming   100
2    volleyball  100 
3    basketball  1000       football    10
4    swimming    480        basketball  320         
5    volleyball  140        football    160

Is there an easy way to realize this?

回答1:

Use pd.wide_to_long to reshape the DataFrame, which allows you to group by field and ID and sum the areas. Then pivot_table back to the wide format, after creating the column label with cumcount.

df = (pd.wide_to_long(df, i='ID', j='num', stubnames=['field', 'area'], sep='_')
        .groupby(['ID', 'field'])['area'].sum()
        .reset_index())
#   ID       field    area
#0   1  basketball   250.0
#1   1     scoccer   500.0
#2   1    swimming   100.0
#3   2  volleyball   100.0
#4   3  basketball  1000.0
#5   3    football    10.0
#6   4  basketball   320.0
#7   4    swimming   480.0
#8   5    football   160.0
#9   5  volleyball   140.0

df['idx'] = df.groupby('ID').cumcount()+1
df = (pd.pivot_table(df, index='ID', columns='idx', values=['field', 'area'], 
                     aggfunc='first')
        .sort_index(axis=1, level=1))
df.columns = ['_'.join(map(str, tup)) for tup in df.columns]

    area_1     field_1  area_2     field_2  area_3   field_3
ID                                                          
1    250.0  basketball   500.0     scoccer   100.0  swimming
2    100.0  volleyball     NaN         NaN     NaN       NaN
3   1000.0  basketball    10.0    football     NaN       NaN
4    320.0  basketball   480.0    swimming     NaN       NaN
5    160.0    football   140.0  volleyball     NaN       NaN

Just for fun, you could use the undocumented pd.lreshape instead of wide_to_long.

# Change range to (1,31) for your real data.
pd.lreshape(df, {'area': [f'area_{i}' for i in range(1,5)],
                 'field': [f'field_{i}' for i in range(1,5)]}

#    ID    area       field
#0    1   500.0     scoccer
#1    2   100.0  volleyball
#2    3  1000.0  basketball
#3    4   280.0    swimming
#4    5   110.0  volleyball
#5    1   200.0  basketball
#....
#10   4   320.0  basketball
#11   5    30.0  volleyball
#12   1    50.0  basketball

来源：https://stackoverflow.com/questions/62717796/add-unique-groups-to-df-for-each-row-including-sum-from-other-columns

标签

python

pandas

dataframe

group-by

sum