Pandas fails to aggregate with a list of aggregation functions

问题

How do I specify custom aggregating functions so that they behave correctly when used in list arguments of pandas.DataFrame.aggregate?

Given a two-column dataframe in pandas ...

import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10))
df['a'] = [ 3 * x for x in range(10) ]
df['b'] = [ 1 -2 * x for x in range(10) ]

... aggregating over a list of aggregation function specs is not a problem:

def ok_mean(x):
  return x.mean()

df.aggregate(['mean', np.max, ok_mean])

               a    b
mean        13.5    -8.0
amax        27.0    1.0
ok_mean     13.5    -8.0

but when an aggregation is specified as a (lambda or named) function, this fails to aggregate:

def nok_mean(x):
  return np.mean(x)

df.aggregate([lambda x:  np.mean(x), nok_mean])

                   a                 b
   <lambda> nok_mean <lambda> nok_mean
0   0.0      0.0     1.0     1.0
1   3.0      3.0    -1.0    -1.0
2   6.0      6.0    -3.0    -3.0
3   9.0      9.0    -5.0    -5.0
4   12.0    12.0    -7.0    -7.0
...

Mixing aggregating and non-aggregating specs lead to errors:

df.aggregate(['mean', nok_mean])

~/anaconda3/envs/tsa37_jup/lib/python3.7/site-packages/pandas/core/base.py in _aggregate_multiple_funcs(self, arg, _level, _axis)
    607         # if we are empty
    608         if not len(results):
--> 609             raise ValueError("no results")
    610

While using the aggregating function directly (not in list) gives the expected result:

df.aggregate(nok_mean)

a    13.5
b    -8.0
dtype: float64

Is this a bug or am I missing something in the way that I define aggregation functions? In my real project, i'm using more complex aggregation functions (such as a this percentile one). So my question is:

How do I specify custom aggregating function in order to workaround this bug?

Note that using the custom aggregating function over a rolling, expanding or group-by window gives the expected result:

df.expanding().aggregate(['mean', nok_mean])
## returns cumulative aggregation results as expected

Pandas version: 0.23.4

回答1:

I found that making the aggregating function fail when called with a non-Series arguments is a work-around:

def ok_mean(x):
  return np.mean(x.values)

def ok_mean2(x):
  if not isinstance(x,pd.Series):
    raise ValueError('need Series argument')
  return np.mean(x)

df.aggregate(['mean', ok_mean, ok_mean2])

Seems that in this circumstance (in list argument to pandas.DataFrame.aggregate), pandas first tries to apply the aggregating function to each data point, and from the moment this fails, falls back to the correct behaviour (calling back with the Series to be aggregated).

Using a decorator to force Series arguments:

def assert_argtype(clazz):
    def wrapping(f):
        def wrapper(s):
            if not isinstance(s,clazz):
                raise ValueError('needs %s argument' % clazz)
            return f(s)
        return wrapper
    return wrapping

@assert_argtype(pd.Series)
def nok_mean(x):
    return np.mean(x)

df.aggregate([nok_mean])
## OK now, decorator fixed it!

回答2:

Based on the answers to this question Pandas - DataFrame aggregate behaving oddly

It looks like it is because you are calling np.mean directly on individual values rather than across entire series in the dataframe. Changing the function to

def nok_mean(x):
    return x.mean()

Now allows you to apply multiple functions:

df.agg(['mean', nok_mean])

Returns

             a    b
mean      13.5 -8.0
nok_mean  13.5 -8.0

来源：https://stackoverflow.com/questions/54890646/pandas-fails-to-aggregate-with-a-list-of-aggregation-functions

标签

python

pandas

aggregate