问题
Related to Dataframe aggregate method passing list problem and Pandas fails to aggregate with a list of aggregation functions
Consider this dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10))
df['a'] = [ 3 * x for x in range(10) ]
df['b'] = [ 1 -2 * x for x in range(10) ]
According to the documentation for aggregate you should be able to specify which columns to aggregate using a dict like this:
df.agg({'a' : 'mean'})
Which returns
a 13.5
But if you try to aggregate with a user-defined function like this one
def nok_mean(x):
return np.mean(x)
df.agg({'a' : nok_mean})
It returns the mean for each row rather than the column
a
0 0.0
1 3.0
2 6.0
3 9.0
4 12.0
5 15.0
6 18.0
7 21.0
8 24.0
9 27.0
Why does the user-defined function not return the same as aggregating with np.mean or 'mean'?
This is using pandas version 0.23.4, numpy version 1.15.4, python version 3.7.1
回答1:
The issue has to do with applying np.mean to a series. Let's look at a few examples:
def nok_mean(x):
return x.mean()
df.agg({'a': nok_mean})
a 13.5
dtype: float64
this works as expected because you are using pandas version of mean, which can be applied to a series or a dataframe:
df['a'].agg(nok_mean)
df.apply(nok_mean)
Let's see what happens when np.mean is applied to a series:
def nok_mean1(x):
return np.mean(x)
df['a'].agg(nok_mean1)
df.agg({'a':nok_mean1})
df['a'].apply(nok_mean1)
df['a'].apply(np.mean)
all return
0 0.0
1 3.0
2 6.0
3 9.0
4 12.0
5 15.0
6 18.0
7 21.0
8 24.0
9 27.0
Name: a, dtype: float64
when you apply np.mean to a dataframe it works as expected:
df.agg(nok_mean1)
df.apply(nok_mean1)
a 13.5
b -8.0
dtype: float64
in order to get np.mean to work as expected with a function pass an ndarray for x:
def nok_mean2(x):
return np.mean(x.values)
df.agg({'a':nok_mean2})
a 13.5
dtype: float64
I am guessing all of this has to do with apply, which is why df['a'].apply(nok_mean2) returns an attribute error.
I am guessing here in the source code
回答2:
When you define your nok_mean function, your function definition is basically saying that you want np.mean for each row
It finds the mean for each row and returns you the result. For example, if your dataframe looked like this:
a b
0 [0, 0] 1
1 [3, 4] -1
2 [6, 8] -3
3 [9, 12] -5
4 [12, 16] -7
5 [15, 20] -9
6 [18, 24] -11
7 [21, 28] -13
8 [24, 32] -15
9 [27, 36] -17
Then df.agg({'a', nok_mean}) would return this:
a
0 0.0
1 3.5
2 7.0
3 10.5
4 14.0
5 17.5
6 21.0
7 24.5
8 28.0
9 31.5
回答3:
This is related to how calculations are made on pandas side.
When you pass a dict of functions, the input is treated as a DataFrame instead of a flattened array. After that all calculations are made over the index axis by default. That's why you're getting the means by row.
If you go to the docs page you'll see:
The aggregation operations are always performed over an axis, either the index (default) or the column axis. This behavior is different from
numpyaggregation functions (mean,median,prod,sum,std,var), where the default is to compute the aggregation of the flattened array, e.g.,numpy.mean(arr_2d)as opposed tonumpy.mean(arr_2d, axis=0).
__
I think the only way to emulate numpy's behavior and pass a dict of functions to agg at the same time is df.agg(nok_mean)['a'].
来源:https://stackoverflow.com/questions/54892806/pandas-dataframe-aggregate-behaving-oddly