问题
Related to Dataframe aggregate method passing list problem and Pandas fails to aggregate with a list of aggregation functions
Consider this dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10))
df['a'] = [ 3 * x for x in range(10) ]
df['b'] = [ 1 -2 * x for x in range(10) ]
According to the documentation for aggregate
you should be able to specify which columns to aggregate using a dict
like this:
df.agg({'a' : 'mean'})
Which returns
a 13.5
But if you try to aggregate
with a user-defined function like this one
def nok_mean(x):
return np.mean(x)
df.agg({'a' : nok_mean})
It returns the mean for each row rather than the column
a
0 0.0
1 3.0
2 6.0
3 9.0
4 12.0
5 15.0
6 18.0
7 21.0
8 24.0
9 27.0
Why does the user-defined function not return the same as aggregating with np.mean
or 'mean'
?
This is using pandas
version 0.23.4
, numpy
version 1.15.4
, python
version 3.7.1
回答1:
The issue has to do with applying np.mean
to a series. Let's look at a few examples:
def nok_mean(x):
return x.mean()
df.agg({'a': nok_mean})
a 13.5
dtype: float64
this works as expected because you are using pandas version of mean, which can be applied to a series or a dataframe:
df['a'].agg(nok_mean)
df.apply(nok_mean)
Let's see what happens when np.mean
is applied to a series:
def nok_mean1(x):
return np.mean(x)
df['a'].agg(nok_mean1)
df.agg({'a':nok_mean1})
df['a'].apply(nok_mean1)
df['a'].apply(np.mean)
all return
0 0.0
1 3.0
2 6.0
3 9.0
4 12.0
5 15.0
6 18.0
7 21.0
8 24.0
9 27.0
Name: a, dtype: float64
when you apply np.mean
to a dataframe it works as expected:
df.agg(nok_mean1)
df.apply(nok_mean1)
a 13.5
b -8.0
dtype: float64
in order to get np.mean
to work as expected with a function pass an ndarray for x:
def nok_mean2(x):
return np.mean(x.values)
df.agg({'a':nok_mean2})
a 13.5
dtype: float64
I am guessing all of this has to do with apply
, which is why df['a'].apply(nok_mean2)
returns an attribute error.
I am guessing here in the source code
回答2:
When you define your nok_mean
function, your function definition is basically saying that you want np.mean
for each row
It finds the mean for each row and returns you the result. For example, if your dataframe looked like this:
a b
0 [0, 0] 1
1 [3, 4] -1
2 [6, 8] -3
3 [9, 12] -5
4 [12, 16] -7
5 [15, 20] -9
6 [18, 24] -11
7 [21, 28] -13
8 [24, 32] -15
9 [27, 36] -17
Then df.agg({'a', nok_mean})
would return this:
a
0 0.0
1 3.5
2 7.0
3 10.5
4 14.0
5 17.5
6 21.0
7 24.5
8 28.0
9 31.5
回答3:
This is related to how calculations are made on pandas side.
When you pass a dict of functions, the input is treated as a DataFrame instead of a flattened array. After that all calculations are made over the index axis by default. That's why you're getting the means by row.
If you go to the docs page you'll see:
The aggregation operations are always performed over an axis, either the index (default) or the column axis. This behavior is different from
numpy
aggregation functions (mean
,median
,prod
,sum
,std
,var
), where the default is to compute the aggregation of the flattened array, e.g.,numpy.mean(arr_2d)
as opposed tonumpy.mean(arr_2d, axis=0)
.
__
I think the only way to emulate numpy's behavior and pass a dict of functions to agg at the same time is df.agg(nok_mean)['a']
.
来源:https://stackoverflow.com/questions/54892806/pandas-dataframe-aggregate-behaving-oddly