Pandas Groupby: How to use two lambda functions?

风流意气都作罢 提交于 2021-01-27 20:13:53

问题


I can currently do the following in Pandas, but I get a stern finger wagging from FutureWarning:

grpd = df.groupby("rank").agg({
    "mean": np.mean, "meian": np.median, "min": np.min, "max": np.max, 
    "25th percentile": lambda x: np.percentile(x, 25),
    "75th percentile": lambda x: np.percentile(x, 75)
})

The following throws an error because I have two lambda functions:

percentile_25 = lambda x: np.percentile(x, 25)
percentile_75 = lambda x: np.percentile(x, 75)

df = diffs[["User Installs", "rank"]].dropna()
grpd = df.groupby("shopping_rank").agg([
    np.mean, np.median, np.min, np.max, 
    percentile_25, percentile_75
])

This throws:

SpecificationError: Function names must be unique, found multiple named <lambda>

The only way I can seem to make this work (without ignoring the warning, which I should probably just do) is with an elaborate process like the following

  1. Define my DF with one lambda function (25th percentile), and everything else I need (min, max, etc.)
  2. rename the cols to get rid of the MultiIndex
  3. Create ANOTHER DF, do ANOTHER grouping, this time with the other column I want (75th percentile)
  4. Rename cols again (thanks MultiIndex!)
  5. Join back to the original DF on the index

Is there something I'm missing here? Surely there's a better way to do what I imagine is a pretty common thing (using two aggregations that aren't directly importable from numpy).


回答1:


It's a known bug, use:

def percentile_25(x): return np.percentile(x, 25)
def percentile_75(x): return np.percentile(x, 75)



回答2:


Try the following little hack:

percentile_25 = lambda x: np.percentile(x, 25)
percentile_25.__name__ = 'percentile_25'
percentile_75 = lambda x: np.percentile(x, 75)
percentile_75.__name__ = 'percentile_75'



回答3:


The problem is a result column name.

An alternative:

percentile_25 = lambda x: np.percentile(x, 25)
percentile_75 = lambda x: np.percentile(x, 75)

grouped = df.groupby("field1")
grouped.agg({
    'field2': {'percentile_25': percentile_25, 'percentile_75': percentile_75}
})



回答4:


Here is another, similar way to MaxU, however, it allows you to create an arbitrary number of lambda functions. So, if we want each 10th percentile can do as follows,

n_percentile_groups = 10
lambda_list = []

for pcntl in np.linspace(10, 100, n_percentile_groups):
    lmbd = lambda x, pcntl=pcntl: np.percentile(x, int(pcntl))
    lmbd.__name__ = 'percentile_%d' % pcntl
    lambda_list.append(lmbd)

Now pass lambda_list to groupby.agg() or append with other functions lists, e.g, lambda_list + [np.mean, np.min, ...].

If you instead wanted just 5 different percentiles then you could change n_percentile_groups = 5.

Ultimately, I am not sure if it is a robust or good way of doing this - with a variable number of lambdas - but since the groupby deprecation - 0.21 it seems to be the only way I know. Comments on this very welcome.



来源:https://stackoverflow.com/questions/46813921/pandas-groupby-how-to-use-two-lambda-functions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!