Naming returned columns in Pandas aggregate function?

匿名 (未验证) 提交于 2019-12-03 02:41:02

问题:

I'm having trouble with Pandas' groupby functionality. I've read the documentation, but I can't see to figure out how to apply aggregate functions to multiple columns and have custom names for those columns.

This comes very close, but the data structure returned has nested column headings:

data.groupby("Country").agg(         {"column1": {"foo": sum()}, "column2": {"mean": np.mean, "std": np.std}}) 

(ie. I want to take the mean and std of column2, but return those columns as "mean" and "std")

What am I missing?

回答1:

This will drop the outermost level from the hierarchical column index:

df = data.groupby(...).agg(...) df.columns = df.columns.droplevel(0) 

If you'd like to keep the outermost level, you can use the ravel() function on the multi-level column to form new labels:

df.columns = ["_".join(x) for x in df.columns.ravel()] 

For example:

import pandas as pd import pandas.rpy.common as com import numpy as np  data = com.load_data('Loblolly') print(data.head()) #     height  age Seed # 1     4.51    3  301 # 15   10.89    5  301 # 29   28.72   10  301 # 43   41.74   15  301 # 57   52.70   20  301  df = data.groupby('Seed').agg(     {'age':['sum'],      'height':['mean', 'std']}) print(df.head()) #       age     height            #       sum        std       mean # Seed                            # 301    78  22.638417  33.246667 # 303    78  23.499706  34.106667 # 305    78  23.927090  35.115000 # 307    78  22.222266  31.328333 # 309    78  23.132574  33.781667  df.columns = df.columns.droplevel(0) print(df.head()) 

yields

      sum        std       mean Seed                            301    78  22.638417  33.246667 303    78  23.499706  34.106667 305    78  23.927090  35.115000 307    78  22.222266  31.328333 309    78  23.132574  33.781667 

Alternatively, to keep the first level of the index:

df = data.groupby('Seed').agg(     {'age':['sum'],      'height':['mean', 'std']}) df.columns = ["_".join(x) for x in df.columns.ravel()] 

yields

      age_sum   height_std  height_mean Seed                            301        78    22.638417    33.246667 303        78    23.499706    34.106667 305        78    23.927090    35.115000 307        78    22.222266    31.328333 309        78    23.132574    33.781667 


回答2:

The currently accepted answer by unutbu describes are great way of doing this in pandas versions

Series:

FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version

DataFrames:

FutureWarning: using a dict with renaming is deprecated and will be removed in a future version

According to the pandas 0.20 changelog, the recommended way of renaming columns while aggregating is as follows.

# Create a sample data frame df = pd.DataFrame({'A': [1, 1, 1, 2, 2],                    'B': range(5),                    'C': range(5)})  # ==== SINGLE COLUMN (SERIES) ==== # Syntax soon to be deprecated df.groupby('A').B.agg({'foo': 'count'}) # Recommended replacement syntax df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})  # ==== MULTI COLUMN ==== # Syntax soon to be deprecated df.groupby('A').agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}}) # Recommended replacement syntax df.groupby('A').agg({'B': 'sum', 'C': 'min'}).rename(columns={'B': 'foo', 'C': 'bar'}) # As the recommended syntax is more verbose, parentheses can # be used to introduce line breaks and increase readability (df.groupby('A')     .agg({'B': 'sum', 'C': 'min'})     .rename(columns={'B': 'foo', 'C': 'bar'}) ) 

Please see the 0.20 changelog for additional details.


Update 2017-01-03 in response to @JunkMechanic's comment.

With the old style dictionary syntax, it was possible to pass multiple lambda functions to .agg, since these would be renamed with the key in the passed dictionary:

>>> df.groupby('A').agg({'B': {'min': lambda x: x.min(), 'max': lambda x: x.max()}})      B       max min A         1   2   0 2   4   3 

Multiple functions can also be passed to a single column as a list:

>>> df.groupby('A').agg({'B': [np.min, np.max]})       B        amin amax A           1    0    2 2    3    4 

However, this does not work with lambda functions, since they are anonymous and all return , which causes a name collision:

>>> df.groupby('A').agg({'B': [lambda x: x.min(), lambda x: x.max]}) SpecificationError: Function names must be unique, found multiple named 

To avoid the SpecificationError, named functions can be defined a priori instead of using lambda. Suitable function names also avoid calling .rename on the data frame afterwards. These functions can be passed with the same list syntax as above:

>>> def my_min(x): >>>     return x.min()  >>> def my_max(x): >>>     return x.max()  >>> df.groupby('A').agg({'B': [my_min, my_max]})         B          my_min my_max A               1      0      2 2      3      4 


回答3:

If you want to have a behavior similar to JMP, creating column titles that keep all info from the multi index you can use:

newidx = [] for (n1,n2) in df.columns.ravel():     newidx.append("%s-%s" % (n1,n2)) df.columns=newidx 

It will change your dataframe from:

    I                       V     mean        std         first V 4200.0  25.499536   31.557133   4200.0 4300.0  25.605662   31.678046   4300.0 4400.0  26.679005   32.919996   4400.0 4500.0  26.786458   32.811633   4500.0 

to

    I-mean      I-std       V-first V 4200.0  25.499536   31.557133   4200.0 4300.0  25.605662   31.678046   4300.0 4400.0  26.679005   32.919996   4400.0 4500.0  26.786458   32.811633   4500.0 


回答4:

With the inspiration of @Joel Ostblom

For those who already have a workable dictionary for merely aggregation, you can use/modify the following code for the newer version aggregation, separating aggregation and renaming part. Please be aware of the nested dictionary if there are more than 1 item.

def agg_translate_agg_rename(input_agg_dict):     agg_dict = {}     rename_dict = {}     for k, v in input_agg_dict.items():         if len(v) == 1:             agg_dict[k] = list(v.values())[0]             rename_dict[k] = list(v.keys())[0]         else:             updated_index = 1             for nested_dict_k, nested_dict_v in v.items():                 modified_key = k + "_" + str(updated_index)                 agg_dict[modified_key] = nested_dict_v                 rename_dict[modified_key] = nested_dict_k                 updated_index += 1     return agg_dict, rename_dict  one_dict = {"column1": {"foo": 'sum'}, "column2": {"mean": 'mean', "std": 'std'}} agg, rename = agg_translator_aa(one_dict) 

We get

agg = {'column1': 'sum', 'column2_1': 'mean', 'column2_2': 'std'} rename = {'column1': 'foo', 'column2_1': 'mean', 'column2_2': 'std'} 

Please let me know if there is a smarter way to do it. Thanks.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!