可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm having trouble with Pandas' groupby functionality. I've read the documentation, but I can't see to figure out how to apply aggregate functions to multiple columns and have custom names for those columns.
This comes very close, but the data structure returned has nested column headings:
data.groupby("Country").agg( {"column1": {"foo": sum()}, "column2": {"mean": np.mean, "std": np.std}})
(ie. I want to take the mean and std of column2, but return those columns as "mean" and "std")
What am I missing?
回答1:
This will drop the outermost level from the hierarchical column index:
df = data.groupby(...).agg(...) df.columns = df.columns.droplevel(0)
If you'd like to keep the outermost level, you can use the ravel() function on the multi-level column to form new labels:
df.columns = ["_".join(x) for x in df.columns.ravel()]
For example:
import pandas as pd import pandas.rpy.common as com import numpy as np data = com.load_data('Loblolly') print(data.head()) # height age Seed # 1 4.51 3 301 # 15 10.89 5 301 # 29 28.72 10 301 # 43 41.74 15 301 # 57 52.70 20 301 df = data.groupby('Seed').agg( {'age':['sum'], 'height':['mean', 'std']}) print(df.head()) # age height # sum std mean # Seed # 301 78 22.638417 33.246667 # 303 78 23.499706 34.106667 # 305 78 23.927090 35.115000 # 307 78 22.222266 31.328333 # 309 78 23.132574 33.781667 df.columns = df.columns.droplevel(0) print(df.head())
yields
sum std mean Seed 301 78 22.638417 33.246667 303 78 23.499706 34.106667 305 78 23.927090 35.115000 307 78 22.222266 31.328333 309 78 23.132574 33.781667
Alternatively, to keep the first level of the index:
df = data.groupby('Seed').agg( {'age':['sum'], 'height':['mean', 'std']}) df.columns = ["_".join(x) for x in df.columns.ravel()]
yields
age_sum height_std height_mean Seed 301 78 22.638417 33.246667 303 78 23.499706 34.106667 305 78 23.927090 35.115000 307 78 22.222266 31.328333 309 78 23.132574 33.781667
回答2:
The currently accepted answer by unutbu describes are great way of doing this in pandas versions
Series:
FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version
DataFrames:
FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
According to the pandas 0.20 changelog, the recommended way of renaming columns while aggregating is as follows.
# Create a sample data frame df = pd.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)}) # ==== SINGLE COLUMN (SERIES) ==== # Syntax soon to be deprecated df.groupby('A').B.agg({'foo': 'count'}) # Recommended replacement syntax df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'}) # ==== MULTI COLUMN ==== # Syntax soon to be deprecated df.groupby('A').agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}}) # Recommended replacement syntax df.groupby('A').agg({'B': 'sum', 'C': 'min'}).rename(columns={'B': 'foo', 'C': 'bar'}) # As the recommended syntax is more verbose, parentheses can # be used to introduce line breaks and increase readability (df.groupby('A') .agg({'B': 'sum', 'C': 'min'}) .rename(columns={'B': 'foo', 'C': 'bar'}) )
Please see the 0.20 changelog for additional details.
Update 2017-01-03 in response to @JunkMechanic's comment.
With the old style dictionary syntax, it was possible to pass multiple lambda
functions to .agg
, since these would be renamed with the key in the passed dictionary:
>>> df.groupby('A').agg({'B': {'min': lambda x: x.min(), 'max': lambda x: x.max()}}) B max min A 1 2 0 2 4 3
Multiple functions can also be passed to a single column as a list:
>>> df.groupby('A').agg({'B': [np.min, np.max]}) B amin amax A 1 0 2 2 3 4
However, this does not work with lambda functions, since they are anonymous and all return
, which causes a name collision:
>>> df.groupby('A').agg({'B': [lambda x: x.min(), lambda x: x.max]}) SpecificationError: Function names must be unique, found multiple named
To avoid the SpecificationError
, named functions can be defined a priori instead of using lambda
. Suitable function names also avoid calling .rename
on the data frame afterwards. These functions can be passed with the same list syntax as above:
>>> def my_min(x): >>> return x.min() >>> def my_max(x): >>> return x.max() >>> df.groupby('A').agg({'B': [my_min, my_max]}) B my_min my_max A 1 0 2 2 3 4
回答3:
If you want to have a behavior similar to JMP, creating column titles that keep all info from the multi index you can use:
newidx = [] for (n1,n2) in df.columns.ravel(): newidx.append("%s-%s" % (n1,n2)) df.columns=newidx
It will change your dataframe from:
I V mean std first V 4200.0 25.499536 31.557133 4200.0 4300.0 25.605662 31.678046 4300.0 4400.0 26.679005 32.919996 4400.0 4500.0 26.786458 32.811633 4500.0
to
I-mean I-std V-first V 4200.0 25.499536 31.557133 4200.0 4300.0 25.605662 31.678046 4300.0 4400.0 26.679005 32.919996 4400.0 4500.0 26.786458 32.811633 4500.0
回答4:
With the inspiration of @Joel Ostblom
For those who already have a workable dictionary for merely aggregation, you can use/modify the following code for the newer version aggregation, separating aggregation and renaming part. Please be aware of the nested dictionary if there are more than 1 item.
def agg_translate_agg_rename(input_agg_dict): agg_dict = {} rename_dict = {} for k, v in input_agg_dict.items(): if len(v) == 1: agg_dict[k] = list(v.values())[0] rename_dict[k] = list(v.keys())[0] else: updated_index = 1 for nested_dict_k, nested_dict_v in v.items(): modified_key = k + "_" + str(updated_index) agg_dict[modified_key] = nested_dict_v rename_dict[modified_key] = nested_dict_k updated_index += 1 return agg_dict, rename_dict one_dict = {"column1": {"foo": 'sum'}, "column2": {"mean": 'mean', "std": 'std'}} agg, rename = agg_translator_aa(one_dict)
We get
agg = {'column1': 'sum', 'column2_1': 'mean', 'column2_2': 'std'} rename = {'column1': 'foo', 'column2_1': 'mean', 'column2_2': 'std'}
Please let me know if there is a smarter way to do it. Thanks.