问题
How do I specify custom aggregating functions so that they behave correctly when used in list arguments of pandas.DataFrame.aggregate
?
Given a two-column dataframe in pandas ...
import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10))
df['a'] = [ 3 * x for x in range(10) ]
df['b'] = [ 1 -2 * x for x in range(10) ]
... aggregating over a list of aggregation function specs is not a problem:
def ok_mean(x):
return x.mean()
df.aggregate(['mean', np.max, ok_mean])
a b
mean 13.5 -8.0
amax 27.0 1.0
ok_mean 13.5 -8.0
but when an aggregation is specified as a (lambda or named) function, this fails to aggregate:
def nok_mean(x):
return np.mean(x)
df.aggregate([lambda x: np.mean(x), nok_mean])
a b
<lambda> nok_mean <lambda> nok_mean
0 0.0 0.0 1.0 1.0
1 3.0 3.0 -1.0 -1.0
2 6.0 6.0 -3.0 -3.0
3 9.0 9.0 -5.0 -5.0
4 12.0 12.0 -7.0 -7.0
...
Mixing aggregating and non-aggregating specs lead to errors:
df.aggregate(['mean', nok_mean])
~/anaconda3/envs/tsa37_jup/lib/python3.7/site-packages/pandas/core/base.py in _aggregate_multiple_funcs(self, arg, _level, _axis)
607 # if we are empty
608 if not len(results):
--> 609 raise ValueError("no results")
610
While using the aggregating function directly (not in list) gives the expected result:
df.aggregate(nok_mean)
a 13.5
b -8.0
dtype: float64
Is this a bug or am I missing something in the way that I define aggregation functions? In my real project, i'm using more complex aggregation functions (such as a this percentile one). So my question is:
How do I specify custom aggregating function in order to workaround this bug?
Note that using the custom aggregating function over a rolling, expanding or group-by window gives the expected result:
df.expanding().aggregate(['mean', nok_mean])
## returns cumulative aggregation results as expected
Pandas version: 0.23.4
回答1:
I found that making the aggregating function fail when called with a non-Series arguments is a work-around:
def ok_mean(x):
return np.mean(x.values)
def ok_mean2(x):
if not isinstance(x,pd.Series):
raise ValueError('need Series argument')
return np.mean(x)
df.aggregate(['mean', ok_mean, ok_mean2])
Seems that in this circumstance (in list argument to pandas.DataFrame.aggregate
), pandas first tries to apply the aggregating function to each data point, and from the moment this fails, falls back to the correct behaviour (calling back with the Series to be aggregated).
Using a decorator to force Series arguments:
def assert_argtype(clazz):
def wrapping(f):
def wrapper(s):
if not isinstance(s,clazz):
raise ValueError('needs %s argument' % clazz)
return f(s)
return wrapper
return wrapping
@assert_argtype(pd.Series)
def nok_mean(x):
return np.mean(x)
df.aggregate([nok_mean])
## OK now, decorator fixed it!
回答2:
Based on the answers to this question Pandas - DataFrame aggregate behaving oddly
It looks like it is because you are calling np.mean
directly on individual values rather than across entire series in the dataframe. Changing the function to
def nok_mean(x):
return x.mean()
Now allows you to apply multiple functions:
df.agg(['mean', nok_mean])
Returns
a b
mean 13.5 -8.0
nok_mean 13.5 -8.0
来源:https://stackoverflow.com/questions/54890646/pandas-fails-to-aggregate-with-a-list-of-aggregation-functions