问题
I have a dataframe in pandas.
test_df = pd.DataFrame({'date': ['2018-12-28', '2018-12-28', '2018-12-29', '2018-12-29', '2018-12-30', '2018-12-30'],
'transaction': ['aa', 'bb', 'cc', 'aa', 'bb', 'bb'],
'ccy': ['USD', 'EUR', 'EUR', 'USD', 'USD', 'USD'],
'amt': np.random.random(6)})
test_df:
date transaction ccy amt
2018-12-28 aa USD 0.323439
2018-12-28 bb EUR 0.048948
2018-12-29 cc EUR 0.793263
2018-12-29 aa USD 0.013865
2018-12-30 bb USD 0.658571
2018-12-30 bb USD 0.224951
The following code is giving me this output.
grouper = test_df.groupby([pd.Grouper('date'), 'transaction', 'ccy'])
grp_transactions = grouper['amt'].sum().unstack()
output:
ccy EUR USD
date transaction
2018-12-28 aa NaN 0.323439
bb 0.048948 NaN
2018-12-29 aa NaN 0.013865
cc 0.793263 NaN
2018-12-30 bb NaN 0.883523
I believe this is expected as the groupby function will group values in the columns based on the order above, sum accordingly, and not create new rows for transactions that are not in the DF.
Is there a way in pandas to include NaN values if a transaction is not done on a particular day when using groupby? ie. Output should be NaN for both ccy if my DF does not have transaction: cc on 28/12/2018.
Expected output:
ccy EUR USD
date transaction
2018-12-28 aa NaN 0.323439
bb 0.048948 NaN
cc NaN NaN
2018-12-29 aa NaN 0.013865
bb NaN NaN
cc 0.793263 NaN
2018-12-30 aa NaN NaN
bb NaN 0.883523
cc NaN NaN
Any help would be appreciated. Thanks!
回答1:
This is easy if you convert "transaction" to a categorical column before grouping,
df.transaction = pd.Categorical(df.transaction)
df.groupby(['date', 'transaction', 'ccy']).sum().unstack(2)
amt
ccy EUR USD
date transaction
2018-12-28 aa NaN 0.404488
bb 0.459295 NaN
cc NaN NaN
2018-12-29 aa NaN 0.439354
bb NaN NaN
cc 0.429269 NaN
2018-12-30 aa NaN NaN
bb NaN 1.542451
cc NaN NaN
Missing categories in the output are represented by NaNs. This is usually possible when performing numeric aggregation.
If you don't want to modify df
, this will do:
u = pd.Series(pd.Categorical(df.transaction), name='transaction')
df.groupby(['date', u, 'ccy']).sum().unstack(2)
amt
ccy EUR USD
date transaction
2018-12-28 aa NaN 0.429134
bb 0.852355 NaN
cc NaN NaN
2018-12-29 aa NaN 0.541576
bb NaN NaN
cc 0.994095 NaN
2018-12-30 aa NaN NaN
bb NaN 0.744587
cc NaN NaN
来源:https://stackoverflow.com/questions/54033021/include-missing-group-keys-as-nan-in-pandas-groupby-output