Have Pandas column containing lists, how to pivot unique list elements to columns?

前端 未结 5 828
旧时难觅i
旧时难觅i 2021-02-07 22:39

I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attr

5条回答
  •  野的像风
    2021-02-07 23:34

    Here is my crack at a solution extended from a problem I was already working on.

    def group_agg_pivot_df(df, group_cols, agg_func='count', agg_col=None):
    
        if agg_col is None:
            agg_col = group_cols[0]
    
        grouped = df.groupby(group_cols).agg({agg_col: agg_func}) \
            .unstack().fillna(0)
        # drop aggregation column name from hierarchical column names
        grouped.columns = grouped.columns.droplevel()
    
        # promote index to column (the first element of group_cols)
        pivot_df = grouped.reset_index()
        pivot_df.columns = [s.replace(' ', '_').lower() for s in pivot_df.columns]
        return pivot_df
    
    def split_stack_df(df, id_cols, split_col, new_col_name):
        # id_cols are the columns we want to pair with the values
        # from the split column
    
        stacked = df.set_index(id_cols)[split_col].str.split(',', expand=True) \
            .stack().reset_index(level=id_cols)
        stacked.columns = id_cols + [new_col_name]
        return stacked
    
    stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc')
    final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])
    

    I also benchmarked @MaxU's, @piRSquared's, and my solutions on a pandas data frame with 11592 rows, and a column containing lists with 2681 unique values. Obviously the column names are different in the testing data frame but I have kept them the same as in the question.

    Here are the benchmarks for each method

    In [277]: %timeit pd.get_dummies(df.set_index(['PRODUCTS', 'DATE']) \
     ...:                        .DESCRIPTION.str.split(',', expand=True) \
     ...:                        .stack()) \
     ...:     .groupby(['PRODUCTS', 'DATE']).sum()
     ...: 
    

    1 loop, best of 3: 1.14 s per loop

    In [278]: %timeit df.set_index(['PRODUCTS', 'DATE']) \
     ...:     .DESCRIPTION.str.split(',', expand=True) \
     ...:     .stack() \
     ...:     .reset_index() \
     ...:     .pivot_table(index=['PRODUCTS', 'DATE'], columns=0, fill_value=0, aggfunc='size')
    

    1 loop, best of 3: 612 ms per loop

    In [286]: %timeit stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc'); \
     ...:     final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])
    

    1 loop, best of 3: 62.7 ms per loop

    My guess is that aggregation and unstacking is faster than either pivot_table() or pd.get_dummies().

提交回复
热议问题