Use pandas to group by column and then create a new column based on a condition

╄→гoц情女王★ 提交于 2019-12-11 11:54:35

问题


I need to reproduce with pandas what SQL does so easily:

select
    del_month
    , sum(case when off0_on1 = 1 then 1 else 0 end) as on1
    , sum(case when off0_on1 = 0 then 1 else 0 end) as off0
from a1
group by del_month
order by del_month

Here is a sample, illustrative pandas dataframe to work on:

a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2], 'off0_on1':[0,0,1,1,0,1,1,1]})

Here are my attempts to reproduce the above SQL with pandas. The first line works. The second line gives an error:

a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(sum)
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(sum(lambda x: 1 if x == 0 else 0))

Here's the second line's error:

TypeError: 'function' object is not iterable

This previous question of mine had a problem with the lambda function, which was solved. The bigger problem is how to reproduce SQL's "sum(case when)" logic on grouped data. I'm looking for a general solution, since I need to do this sort of thing often. The answers in my previous question suggested using map() inside the lambda function, but the following results for the "off0" column are not what I need. The "on1" column is what I want. The answer should be the same for the whole group (i.e. "del_month").


回答1:


Simply sum the Trues in your conditional logic expressions:

import pandas as pd

a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2], 
                   'off0_on1':[0,0,1,1,0,1,1,1]})

a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==1))    
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==0))

print(a1)    
#    del_month  off0_on1  on1  off0
# 0          1         0    2     2
# 1          1         0    2     2
# 2          1         1    2     2
# 3          1         1    2     2
# 4          2         0    3     1
# 5          2         1    3     1
# 6          2         1    3     1
# 7          2         1    3     1

Similarly, you can do the same in SQL if dialect supports it which most should:

select
    del_month
    , sum(off0_on1 = 1) as on1
    , sum(off0_on1 = 0) as off0
from a1
group by del_month
order by del_month

And to replicate above SQL in pandas, don't use transform but send multiple aggregates in a groupby().apply() call:

def aggfunc(x):
    data = {'on1': sum(x['off0_on1'] == 1),
            'off0': sum(x['off0_on1'] == 0)}

    return pd.Series(data)

g = a1.groupby('del_month').apply(aggfunc)

print(g)    
#            on1  off0
# del_month           
# 1            2     2
# 2            3     1



回答2:


Using get_dummies would only need a single groupby call, which is simpler.

v = pd.get_dummies(df.pop('off0_on1')).groupby(df.del_month).transform(sum)
df = pd.concat([df, v.rename({0: 'off0', 1: 'on1'}, axis=1)], axis=1)

df
   del_month  off0  on1
0          1     2    2
1          1     2    2
2          1     2    2
3          1     2    2
4          2     1    3
5          2     1    3
6          2     1    3
7          2     1    3

Additionally, for the case of aggregation, call sum directly instead of using apply:

(pd.get_dummies(df.pop('off0_on1'))
   .groupby(df.del_month)
   .sum()
   .rename({0: 'off0', 1: 'on1'}, axis=1))

           off0  on1
del_month           
1             2    2
2             1    3


来源:https://stackoverflow.com/questions/53107112/use-pandas-to-group-by-column-and-then-create-a-new-column-based-on-a-condition

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!