Pandas groupby with categories with redundant nan

后端未结

关注

 5  1025

说谎 2020-12-05 02:26

I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. B

5条回答

难免孤独 (楼主)

2020-12-05 03:27
I found this post while debugging something similar. Very good post, and I really like the inclusion of boundary conditions!

Here's the code that accomplishes the initial goal:
```
r = df.groupby(group_cols, as_index=False).agg({'Value': 'sum'})

r.columns = ['_'.join(col).strip('_') for col in r.columns]
```
The downside of this solution is that it results in a hierarchical column index that you may want to flatten (especially if you have multiple statistics). I included flattening of column index in the code above.

I don't know why instance methods:
```
df.groupby(group_cols).sum() 
df.groupby(group_cols).mean()
df.groupby(group_cols).stdev()
```
use all unique combinations of categorical variables, while the .agg() method:
```
df.groupby(group_cols).agg(['count', 'sum', 'mean', 'std']) 
```
ignores the unused level combinations of the groups. That seems inconsistent. Just happy that we can use the .agg() method and not have to worry about a Cartesian combination explosion.

Also, I think it is very common to have a much lower unique cardinality count vs. the Cartesian product. Think of all the cases where data has columns like "State", "County", 'Zip"... these are all nested variables and many data sets out there have variables that have a high degree of nesting.

In our case the difference between Cartesian product of the grouping variables and the naturally occurring combinations is over 1000x (and the starting data set is over 1,000,000 rows).

Consequently, I would have voted for making observed=True the default behavior.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...