I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. B
I found the behavior similar to what's documented in the operations section of Categorical Data.
In particular, similar to
In [121]: cats2 = pd.Categorical(["a","a","b","b"], categories=["a","b","c"]) In [122]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]}) In [123]: df2.groupby(["cats","B"]).mean() Out[123]: values cats B a c 1.0 d 2.0 b c 3.0 d 4.0 c c NaN d NaN
Some other words describing the related behavior in Series and groupby. There is also a pivot table example in the end of the section.
Apart from Series.min(), Series.max() and Series.mode(), the following operations are possible with categorical data:
Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data:
Groupby will also show “unused” categories:
The words and the example are cited from Categorical Data.