问题
I can currently do the following in Pandas, but I get a stern finger wagging from FutureWarning:
grpd = df.groupby("rank").agg({
"mean": np.mean, "meian": np.median, "min": np.min, "max": np.max,
"25th percentile": lambda x: np.percentile(x, 25),
"75th percentile": lambda x: np.percentile(x, 75)
})
The following throws an error because I have two lambda functions:
percentile_25 = lambda x: np.percentile(x, 25)
percentile_75 = lambda x: np.percentile(x, 75)
df = diffs[["User Installs", "rank"]].dropna()
grpd = df.groupby("shopping_rank").agg([
np.mean, np.median, np.min, np.max,
percentile_25, percentile_75
])
This throws:
SpecificationError: Function names must be unique, found multiple named <lambda>
The only way I can seem to make this work (without ignoring the warning, which I should probably just do) is with an elaborate process like the following
- Define my DF with one lambda function (25th percentile), and everything else I need (min, max, etc.)
- rename the cols to get rid of the MultiIndex
- Create ANOTHER DF, do ANOTHER grouping, this time with the other column I want (75th percentile)
- Rename cols again (thanks MultiIndex!)
- Join back to the original DF on the index
Is there something I'm missing here? Surely there's a better way to do what I imagine is a pretty common thing (using two aggregations that aren't directly importable from numpy).
回答1:
It's a known bug, use:
def percentile_25(x): return np.percentile(x, 25)
def percentile_75(x): return np.percentile(x, 75)
回答2:
Try the following little hack:
percentile_25 = lambda x: np.percentile(x, 25)
percentile_25.__name__ = 'percentile_25'
percentile_75 = lambda x: np.percentile(x, 75)
percentile_75.__name__ = 'percentile_75'
回答3:
The problem is a result column name.
An alternative:
percentile_25 = lambda x: np.percentile(x, 25)
percentile_75 = lambda x: np.percentile(x, 75)
grouped = df.groupby("field1")
grouped.agg({
'field2': {'percentile_25': percentile_25, 'percentile_75': percentile_75}
})
回答4:
Here is another, similar way to MaxU, however, it allows you to create an arbitrary number of lambda functions. So, if we want each 10th percentile can do as follows,
n_percentile_groups = 10
lambda_list = []
for pcntl in np.linspace(10, 100, n_percentile_groups):
lmbd = lambda x, pcntl=pcntl: np.percentile(x, int(pcntl))
lmbd.__name__ = 'percentile_%d' % pcntl
lambda_list.append(lmbd)
Now pass lambda_list
to groupby.agg()
or append with other functions lists, e.g, lambda_list + [np.mean, np.min, ...]
.
If you instead wanted just 5 different percentiles then you could change n_percentile_groups = 5
.
Ultimately, I am not sure if it is a robust or good way of doing this - with a variable number of lambdas - but since the groupby deprecation - 0.21 it seems to be the only way I know. Comments on this very welcome.
来源:https://stackoverflow.com/questions/46813921/pandas-groupby-how-to-use-two-lambda-functions