I am learning Pandas
package by replicating the outing from some of the R vignettes. Now I am using the dplyr
package from R as an example:
http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
R script
planes <- group_by(hflights_df, TailNum)
delay <- summarise(planes,
count = n(),
dist = mean(Distance, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)
Python script
planes = hflights.groupby('TailNum')
planes['Distance'].agg({'count' : 'count',
'dist' : 'mean'})
How can I state explicitly in python that NA
needs to be skipped?
That's a trick question, since you don't do that. Pandas will automatically exclude NaN
numbers from aggregation functions. Consider my df
:
b c d e
a
2 2 6 1 3
2 4 8 NaN 7
2 4 4 6 3
3 5 NaN 2 6
4 NaN NaN 4 1
5 6 2 1 8
7 3 2 4 7
9 6 1 NaN 1
9 NaN NaN 9 3
9 3 4 6 1
The internal count()
function will ignore NaN
values, and so will mean()
. The only point where we get NaN
, is when the only value is NaN
. Then, we take the mean value of an empty set, which turns out to be NaN
:
In[335]: df.groupby('a').mean()
Out[333]:
b c d e
a
2 3.333333 6.0 3.5 4.333333
3 5.000000 NaN 2.0 6.000000
4 NaN NaN 4.0 1.000000
5 6.000000 2.0 1.0 8.000000
7 3.000000 2.0 4.0 7.000000
9 4.500000 2.5 7.5 1.666667
Aggregate functions work in the same way:
In[340]: df.groupby('a')['b'].agg({'foo': np.mean})
Out[338]:
foo
a
2 3.333333
3 5.000000
4 NaN
5 6.000000
7 3.000000
9 4.500000
Addendum: Notice how the standard dataframe.mean API will allow you to control inclusion of NaN
values, where the default is exclusion.
What foobar said is true in regards to how it was implemented by default, but there is a very easy way to specify skipna. Here is an exemple that speaks for itself:
def custom_mean(df):
return df.mean(skipna=False)
group.agg({"your_col_name_to_be_aggregated":custom_mean})
That's it! You can customize your own aggregation the way you want, and I'd expect this to be fairly efficient, but I did not dig into it.
It was also discussed here, but I thought I'd help spread the good news! Answer was found in the official doc.
来源:https://stackoverflow.com/questions/25039328/specifying-skip-na-when-calculating-mean-of-the-column-in-a-data-frame-created