Pandas groupby with categories with redundant nan

后端 未结 5 1018
说谎
说谎 2020-12-05 02:26

I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. B

5条回答
  •  失恋的感觉
    2020-12-05 03:30

    I was able to get a solution that should work really well. I'll edit my post with a better explanation. But in the mean time, does this work well for you?

    import pandas as pd
    
    group_cols = ['Group1', 'Group2', 'Group3']
    
    df = pd.DataFrame([['A', 'B', 'C', 54.34],
                       ['A', 'B', 'D', 61.34],
                       ['B', 'A', 'C', 514.5],
                       ['B', 'A', 'A', 765.4],
                       ['A', 'B', 'D', 765.4]],
                      columns=(group_cols+['Value']))
    for col in group_cols:
        df[col] = df[col].astype('category')
    
    
    result = df.groupby([df[col].values.codes for col in group_cols]).sum()
    result = result.reset_index()
    level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
    result = result.rename(columns=level_to_column_name)
    for col in group_cols:
        result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
    result
    

    So the answer to this felt more like a proper programming than a normal Pandas question. Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby on these underlying numbers because they don't have the same problem as categorical columns. After doing this I had to rename the columns. I then used the from_codes constructor to create efficiently turn the list of integers back into a categorical column.

    Group1  Group2  Group3  Value
    A       B       C       54.34
    A       B       D       826.74
    B       A       A       765.40
    B       A       C       514.50
    

    So I understand that this isn't exactly your answer but I've made my solution into a little function for people that have this problem in the future.

    def categorical_groupby(df,group_cols,agg_fuction="sum"):
        "Does a groupby on a number of categorical columns"
        result = df.groupby([df[col].values.codes for col in group_cols]).agg(agg_fuction)
        result = result.reset_index()
        level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
        result = result.rename(columns=level_to_column_name)
        for col in group_cols:
            result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
        return result
    

    call it like this:

    df.pipe(categorical_groupby,group_cols)
    

提交回复
热议问题