Pandas - DataFrame aggregate behaving oddly

问题

Related to Dataframe aggregate method passing list problem and Pandas fails to aggregate with a list of aggregation functions

Consider this dataframe

import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10))
df['a'] = [ 3 * x for x in range(10) ]
df['b'] = [ 1 -2 * x for x in range(10) ]

According to the documentation for aggregate you should be able to specify which columns to aggregate using a dict like this:

df.agg({'a' : 'mean'})

Which returns

a    13.5

But if you try to aggregate with a user-defined function like this one

def nok_mean(x):
    return np.mean(x)

df.agg({'a' : nok_mean})

It returns the mean for each row rather than the column

Why does the user-defined function not return the same as aggregating with np.mean or 'mean'?

This is using pandas version 0.23.4, numpy version 1.15.4, python version 3.7.1

回答1:

The issue has to do with applying np.mean to a series. Let's look at a few examples:

def nok_mean(x):
    return x.mean()

df.agg({'a': nok_mean})

a    13.5
dtype: float64

this works as expected because you are using pandas version of mean, which can be applied to a series or a dataframe:

df['a'].agg(nok_mean)
df.apply(nok_mean)

Let's see what happens when np.mean is applied to a series:

def nok_mean1(x):
    return np.mean(x)

df['a'].agg(nok_mean1)
df.agg({'a':nok_mean1})
df['a'].apply(nok_mean1)
df['a'].apply(np.mean)

all return

0     0.0
1     3.0
2     6.0
3     9.0
4    12.0
5    15.0
6    18.0
7    21.0
8    24.0
9    27.0
Name: a, dtype: float64

when you apply np.mean to a dataframe it works as expected:

df.agg(nok_mean1)
df.apply(nok_mean1)

a    13.5
b    -8.0
dtype: float64

in order to get np.mean to work as expected with a function pass an ndarray for x:

def nok_mean2(x):
    return np.mean(x.values)

df.agg({'a':nok_mean2})

a    13.5
dtype: float64

I am guessing all of this has to do with apply, which is why df['a'].apply(nok_mean2) returns an attribute error.

I am guessing here in the source code

回答2:

When you define your nok_mean function, your function definition is basically saying that you want np.mean for each row

It finds the mean for each row and returns you the result. For example, if your dataframe looked like this:

    a           b
0   [0, 0]      1
1   [3, 4]      -1
2   [6, 8]      -3
3   [9, 12]     -5
4   [12, 16]    -7
5   [15, 20]    -9
6   [18, 24]    -11
7   [21, 28]    -13
8   [24, 32]    -15
9   [27, 36]    -17

Then df.agg({'a', nok_mean}) would return this:

回答3:

This is related to how calculations are made on pandas side.

When you pass a dict of functions, the input is treated as a DataFrame instead of a flattened array. After that all calculations are made over the index axis by default. That's why you're getting the means by row.

If you go to the docs page you'll see:

The aggregation operations are always performed over an axis, either the index (default) or the column axis. This behavior is different from numpy aggregation functions (mean, median, prod, sum, std, var), where the default is to compute the aggregation of the flattened array, e.g., numpy.mean(arr_2d) as opposed to numpy.mean(arr_2d, axis=0).

I think the only way to emulate numpy's behavior and pass a dict of functions to agg at the same time is df.agg(nok_mean)['a'].

来源：https://stackoverflow.com/questions/54892806/pandas-dataframe-aggregate-behaving-oddly

标签

pandas

numpy

dataframe

aggregate

series