Pandas groupby with categories with redundant nan

后端未结

关注

 5  1018

说谎 2020-12-05 02:26

I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. B

5条回答

失恋的感觉 (楼主)

2020-12-05 03:30

I was able to get a solution that should work really well. I'll edit my post with a better explanation. But in the mean time, does this work well for you?

import pandas as pd

group_cols = ['Group1', 'Group2', 'Group3']

df = pd.DataFrame([['A', 'B', 'C', 54.34],
                   ['A', 'B', 'D', 61.34],
                   ['B', 'A', 'C', 514.5],
                   ['B', 'A', 'A', 765.4],
                   ['A', 'B', 'D', 765.4]],
                  columns=(group_cols+['Value']))
for col in group_cols:
    df[col] = df[col].astype('category')


result = df.groupby([df[col].values.codes for col in group_cols]).sum()
result = result.reset_index()
level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
result = result.rename(columns=level_to_column_name)
for col in group_cols:
    result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
result

So the answer to this felt more like a proper programming than a normal Pandas question. Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby on these underlying numbers because they don't have the same problem as categorical columns. After doing this I had to rename the columns. I then used the from_codes constructor to create efficiently turn the list of integers back into a categorical column.

Group1  Group2  Group3  Value
A       B       C       54.34
A       B       D       826.74
B       A       A       765.40
B       A       C       514.50

So I understand that this isn't exactly your answer but I've made my solution into a little function for people that have this problem in the future.

def categorical_groupby(df,group_cols,agg_fuction="sum"):
    "Does a groupby on a number of categorical columns"
    result = df.groupby([df[col].values.codes for col in group_cols]).agg(agg_fuction)
    result = result.reset_index()
    level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
    result = result.rename(columns=level_to_column_name)
    for col in group_cols:
        result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
    return result

call it like this:

df.pipe(categorical_groupby,group_cols)

0 讨论(0)

查看其它5个回答