Include missing group keys as NaN in pandas GroupBy output

问题

I have a dataframe in pandas.

test_df = pd.DataFrame({'date': ['2018-12-28', '2018-12-28', '2018-12-29', '2018-12-29', '2018-12-30', '2018-12-30'],
                       'transaction': ['aa', 'bb', 'cc', 'aa', 'bb', 'bb'],
                       'ccy': ['USD', 'EUR', 'EUR', 'USD', 'USD', 'USD'],
                       'amt': np.random.random(6)})

test_df:

date         transaction  ccy       amt
2018-12-28   aa           USD  0.323439
2018-12-28   bb           EUR  0.048948
2018-12-29   cc           EUR  0.793263
2018-12-29   aa           USD  0.013865
2018-12-30   bb           USD  0.658571
2018-12-30   bb           USD  0.224951

The following code is giving me this output.

grouper = test_df.groupby([pd.Grouper('date'), 'transaction', 'ccy'])
grp_transactions = grouper['amt'].sum().unstack()

output:

ccy                          EUR       USD
date       transaction                    
2018-12-28 aa                NaN  0.323439
           bb           0.048948       NaN
2018-12-29 aa                NaN  0.013865
           cc           0.793263       NaN
2018-12-30 bb                NaN  0.883523

I believe this is expected as the groupby function will group values in the columns based on the order above, sum accordingly, and not create new rows for transactions that are not in the DF.

Is there a way in pandas to include NaN values if a transaction is not done on a particular day when using groupby? ie. Output should be NaN for both ccy if my DF does not have transaction: cc on 28/12/2018.

Expected output:

ccy                          EUR       USD
date       transaction                    
2018-12-28 aa                NaN  0.323439
           bb           0.048948       NaN
           cc                NaN       NaN
2018-12-29 aa                NaN  0.013865
           bb                NaN       NaN
           cc           0.793263       NaN
2018-12-30 aa                NaN       NaN
           bb                NaN  0.883523
           cc                NaN       NaN

Any help would be appreciated. Thanks!

回答1:

This is easy if you convert "transaction" to a categorical column before grouping,

df.transaction = pd.Categorical(df.transaction)
df.groupby(['date', 'transaction', 'ccy']).sum().unstack(2)

                             amt          
ccy                          EUR       USD
date       transaction                    
2018-12-28 aa                NaN  0.404488
           bb           0.459295       NaN
           cc                NaN       NaN
2018-12-29 aa                NaN  0.439354
           bb                NaN       NaN
           cc           0.429269       NaN
2018-12-30 aa                NaN       NaN
           bb                NaN  1.542451
           cc                NaN       NaN

Missing categories in the output are represented by NaNs. This is usually possible when performing numeric aggregation.

If you don't want to modify df, this will do:

u = pd.Series(pd.Categorical(df.transaction), name='transaction')
df.groupby(['date', u, 'ccy']).sum().unstack(2)

                             amt          
ccy                          EUR       USD
date       transaction                    
2018-12-28 aa                NaN  0.429134
           bb           0.852355       NaN
           cc                NaN       NaN
2018-12-29 aa                NaN  0.541576
           bb                NaN       NaN
           cc           0.994095       NaN
2018-12-30 aa                NaN       NaN
           bb                NaN  0.744587
           cc                NaN       NaN

来源：https://stackoverflow.com/questions/54033021/include-missing-group-keys-as-nan-in-pandas-groupby-output

标签

python

pandas

group-by

pandas-groupby