Pandas groupby with categories with redundant nan

后端 未结 5 1023
说谎
说谎 2020-12-05 02:26

I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. B

5条回答
  •  悲&欢浪女
    2020-12-05 03:06

    I found the behavior similar to what's documented in the operations section of Categorical Data.

    In particular, similar to

    In [121]: cats2 = pd.Categorical(["a","a","b","b"], categories=["a","b","c"])
    
    In [122]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]})
    
    In [123]: df2.groupby(["cats","B"]).mean()
    Out[123]: 
            values
    cats B        
    a    c     1.0
         d     2.0
    b    c     3.0
         d     4.0
    c    c     NaN
         d     NaN
    

    Some other words describing the related behavior in Series and groupby. There is also a pivot table example in the end of the section.

    Apart from Series.min(), Series.max() and Series.mode(), the following operations are possible with categorical data:

    Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data:

    Groupby will also show “unused” categories:

    The words and the example are cited from Categorical Data.

提交回复
热议问题