Pandas Groupby: How to use two lambda functions?

问题

I can currently do the following in Pandas, but I get a stern finger wagging from FutureWarning:

grpd = df.groupby("rank").agg({
    "mean": np.mean, "meian": np.median, "min": np.min, "max": np.max, 
    "25th percentile": lambda x: np.percentile(x, 25),
    "75th percentile": lambda x: np.percentile(x, 75)
})

The following throws an error because I have two lambda functions:

percentile_25 = lambda x: np.percentile(x, 25)
percentile_75 = lambda x: np.percentile(x, 75)

df = diffs[["User Installs", "rank"]].dropna()
grpd = df.groupby("shopping_rank").agg([
    np.mean, np.median, np.min, np.max, 
    percentile_25, percentile_75
])

This throws:

SpecificationError: Function names must be unique, found multiple named <lambda>

The only way I can seem to make this work (without ignoring the warning, which I should probably just do) is with an elaborate process like the following

Define my DF with one lambda function (25th percentile), and everything else I need (min, max, etc.)
rename the cols to get rid of the MultiIndex
Create ANOTHER DF, do ANOTHER grouping, this time with the other column I want (75th percentile)
Rename cols again (thanks MultiIndex!)
Join back to the original DF on the index

Is there something I'm missing here? Surely there's a better way to do what I imagine is a pretty common thing (using two aggregations that aren't directly importable from numpy).

回答1:

It's a known bug, use:

def percentile_25(x): return np.percentile(x, 25)
def percentile_75(x): return np.percentile(x, 75)

回答2:

Try the following little hack:

percentile_25 = lambda x: np.percentile(x, 25)
percentile_25.__name__ = 'percentile_25'
percentile_75 = lambda x: np.percentile(x, 75)
percentile_75.__name__ = 'percentile_75'

回答3:

The problem is a result column name.

An alternative:

percentile_25 = lambda x: np.percentile(x, 25)
percentile_75 = lambda x: np.percentile(x, 75)

grouped = df.groupby("field1")
grouped.agg({
    'field2': {'percentile_25': percentile_25, 'percentile_75': percentile_75}
})

回答4:

Here is another, similar way to MaxU, however, it allows you to create an arbitrary number of lambda functions. So, if we want each 10th percentile can do as follows,

n_percentile_groups = 10
lambda_list = []

for pcntl in np.linspace(10, 100, n_percentile_groups):
    lmbd = lambda x, pcntl=pcntl: np.percentile(x, int(pcntl))
    lmbd.__name__ = 'percentile_%d' % pcntl
    lambda_list.append(lmbd)

Now pass lambda_list to groupby.agg() or append with other functions lists, e.g, lambda_list + [np.mean, np.min, ...].

If you instead wanted just 5 different percentiles then you could change n_percentile_groups = 5.

Ultimately, I am not sure if it is a robust or good way of doing this - with a variable number of lambdas - but since the groupby deprecation - 0.21 it seems to be the only way I know. Comments on this very welcome.

来源：https://stackoverflow.com/questions/46813921/pandas-groupby-how-to-use-two-lambda-functions

标签

python

pandas

pandas-groupby