I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. B
I was able to get a solution that should work really well. I'll edit my post with a better explanation. But in the mean time, does this work well for you?
import pandas as pd
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', 'D', 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
for col in group_cols:
df[col] = df[col].astype('category')
result = df.groupby([df[col].values.codes for col in group_cols]).sum()
result = result.reset_index()
level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
result = result.rename(columns=level_to_column_name)
for col in group_cols:
result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
result
So the answer to this felt more like a proper programming than a normal Pandas question. Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby on these underlying numbers because they don't have the same problem as categorical columns. After doing this I had to rename the columns. I then used the from_codes constructor to create efficiently turn the list of integers back into a categorical column.
Group1 Group2 Group3 Value
A B C 54.34
A B D 826.74
B A A 765.40
B A C 514.50
So I understand that this isn't exactly your answer but I've made my solution into a little function for people that have this problem in the future.
def categorical_groupby(df,group_cols,agg_fuction="sum"):
"Does a groupby on a number of categorical columns"
result = df.groupby([df[col].values.codes for col in group_cols]).agg(agg_fuction)
result = result.reset_index()
level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
result = result.rename(columns=level_to_column_name)
for col in group_cols:
result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
return result
call it like this:
df.pipe(categorical_groupby,group_cols)