Have Pandas column containing lists, how to pivot unique list elements to columns?

前端未结

关注

 5  828

旧时难觅i 2021-02-07 22:39

I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attr

5条回答

野的像风 (楼主)

2021-02-07 23:34

Here is my crack at a solution extended from a problem I was already working on.

def group_agg_pivot_df(df, group_cols, agg_func='count', agg_col=None):

    if agg_col is None:
        agg_col = group_cols[0]

    grouped = df.groupby(group_cols).agg({agg_col: agg_func}) \
        .unstack().fillna(0)
    # drop aggregation column name from hierarchical column names
    grouped.columns = grouped.columns.droplevel()

    # promote index to column (the first element of group_cols)
    pivot_df = grouped.reset_index()
    pivot_df.columns = [s.replace(' ', '_').lower() for s in pivot_df.columns]
    return pivot_df

def split_stack_df(df, id_cols, split_col, new_col_name):
    # id_cols are the columns we want to pair with the values
    # from the split column

    stacked = df.set_index(id_cols)[split_col].str.split(',', expand=True) \
        .stack().reset_index(level=id_cols)
    stacked.columns = id_cols + [new_col_name]
    return stacked

stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc')
final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])

I also benchmarked @MaxU's, @piRSquared's, and my solutions on a pandas data frame with 11592 rows, and a column containing lists with 2681 unique values. Obviously the column names are different in the testing data frame but I have kept them the same as in the question.

Here are the benchmarks for each method

In [277]: %timeit pd.get_dummies(df.set_index(['PRODUCTS', 'DATE']) \
 ...:                        .DESCRIPTION.str.split(',', expand=True) \
 ...:                        .stack()) \
 ...:     .groupby(['PRODUCTS', 'DATE']).sum()
 ...:

1 loop, best of 3: 1.14 s per loop

In [278]: %timeit df.set_index(['PRODUCTS', 'DATE']) \
 ...:     .DESCRIPTION.str.split(',', expand=True) \
 ...:     .stack() \
 ...:     .reset_index() \
 ...:     .pivot_table(index=['PRODUCTS', 'DATE'], columns=0, fill_value=0, aggfunc='size')

1 loop, best of 3: 612 ms per loop

In [286]: %timeit stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc'); \
 ...:     final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])

1 loop, best of 3: 62.7 ms per loop

My guess is that aggregation and unstacking is faster than either pivot_table() or pd.get_dummies().

0 讨论(0)

查看其它5个回答