Pandas groupby count non-null values as percentage

匿名 (未验证) 提交于 2019-12-03 01:34:02

问题:

Given this dataset, I would like to count missing, NaN, values:

df = pd.DataFrame({'A' : [1, np.nan, 2 , 55, 6, np.nan, -17, np.nan],                    'Team' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],                    'C' : [4, 14, 3 , 8, 8, 7, np.nan, 11],                    'D' : [np.nan, np.nan, -12 , 12, 12, -12, np.nan, np.nan]}) 

Specifically I want to count (as a percentage) per group in the 'Team' column. I can get the raw count by this:

df.groupby('Team').count() 

This will get the number of nonmissing numbers. What I would like to do is create a percentage, so instead of getting the raw number I would get it as a percentage of the total entries in each group (I don't know the size of the groups which are all uneven). I've tried using .agg(), but I can't seem to get what I want. How can I do this?

回答1:

You can take the mean of the notnull Boolean DataFrame:

In [11]: df.notnull() Out[11]:        A      C      D  Team 0   True   True  False  True 1  False   True  False  True 2   True   True   True  True 3   True   True   True  True 4   True   True   True  True 5  False   True   True  True 6   True  False  False  True 7  False   True  False  True  In [12]: df.notnull().mean() Out[12]: A       0.625 C       0.875 D       0.500 Team    1.000 dtype: float64 

and with the groupby:

In [13]: df.groupby("Team").apply(lambda x: x.notnull().mean()) Out[13]:               A         C    D  Team Team one    0.666667  0.666667  0.0   1.0 three  0.500000  1.000000  0.5   1.0 two    0.666667  1.000000  1.0   1.0 

It may be faster to do this without an apply using set_index first:

In [14]: df.set_index("Team").notnull().groupby(level=0).mean() Out[14]:               A         C    D Team one    0.666667  0.666667  0.0 three  0.500000  1.000000  0.5 two    0.666667  1.000000  1.0 


回答2:

Base on your own code add div(df.groupby('Team').size(),0)

df.groupby('Team').count().div(df.groupby('Team').size(),0) Out[190]:                A         C    D Team                           one    0.666667  0.666667  0.0 three  0.500000  1.000000  0.5 two    0.666667  1.000000  1.0 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!