Given this dataset, I would like to count missing, NaN, values:
df = pd.DataFrame({'A' : [1, np.nan, 2 , 55, 6, np.nan, -17, np.nan], 'Team' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C' : [4, 14, 3 , 8, 8, 7, np.nan, 11], 'D' : [np.nan, np.nan, -12 , 12, 12, -12, np.nan, np.nan]})
Specifically I want to count (as a percentage) per group in the 'Team' column. I can get the raw count by this:
df.groupby('Team').count()
This will get the number of nonmissing numbers. What I would like to do is create a percentage, so instead of getting the raw number I would get it as a percentage of the total entries in each group (I don't know the size of the groups which are all uneven). I've tried using .agg(), but I can't seem to get what I want. How can I do this?
You can take the mean
of the notnull
Boolean DataFrame:
In [11]: df.notnull() Out[11]: A C D Team 0 True True False True 1 False True False True 2 True True True True 3 True True True True 4 True True True True 5 False True True True 6 True False False True 7 False True False True In [12]: df.notnull().mean() Out[12]: A 0.625 C 0.875 D 0.500 Team 1.000 dtype: float64
and with the groupby:
In [13]: df.groupby("Team").apply(lambda x: x.notnull().mean()) Out[13]: A C D Team Team one 0.666667 0.666667 0.0 1.0 three 0.500000 1.000000 0.5 1.0 two 0.666667 1.000000 1.0 1.0
It may be faster to do this without an apply using set_index
first:
In [14]: df.set_index("Team").notnull().groupby(level=0).mean() Out[14]: A C D Team one 0.666667 0.666667 0.0 three 0.500000 1.000000 0.5 two 0.666667 1.000000 1.0
Base on your own code add div(df.groupby('Team').size(),0)
df.groupby('Team').count().div(df.groupby('Team').size(),0) Out[190]: A C D Team one 0.666667 0.666667 0.0 three 0.500000 1.000000 0.5 two 0.666667 1.000000 1.0