I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attr
Here is my crack at a solution extended from a problem I was already working on.
def group_agg_pivot_df(df, group_cols, agg_func='count', agg_col=None):
if agg_col is None:
agg_col = group_cols[0]
grouped = df.groupby(group_cols).agg({agg_col: agg_func}) \
.unstack().fillna(0)
# drop aggregation column name from hierarchical column names
grouped.columns = grouped.columns.droplevel()
# promote index to column (the first element of group_cols)
pivot_df = grouped.reset_index()
pivot_df.columns = [s.replace(' ', '_').lower() for s in pivot_df.columns]
return pivot_df
def split_stack_df(df, id_cols, split_col, new_col_name):
# id_cols are the columns we want to pair with the values
# from the split column
stacked = df.set_index(id_cols)[split_col].str.split(',', expand=True) \
.stack().reset_index(level=id_cols)
stacked.columns = id_cols + [new_col_name]
return stacked
stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc')
final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])
I also benchmarked @MaxU's, @piRSquared's, and my solutions on a pandas data frame with 11592 rows, and a column containing lists with 2681 unique values. Obviously the column names are different in the testing data frame but I have kept them the same as in the question.
Here are the benchmarks for each method
In [277]: %timeit pd.get_dummies(df.set_index(['PRODUCTS', 'DATE']) \
...: .DESCRIPTION.str.split(',', expand=True) \
...: .stack()) \
...: .groupby(['PRODUCTS', 'DATE']).sum()
...:
1 loop, best of 3: 1.14 s per loop
In [278]: %timeit df.set_index(['PRODUCTS', 'DATE']) \
...: .DESCRIPTION.str.split(',', expand=True) \
...: .stack() \
...: .reset_index() \
...: .pivot_table(index=['PRODUCTS', 'DATE'], columns=0, fill_value=0, aggfunc='size')
1 loop, best of 3: 612 ms per loop
In [286]: %timeit stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc'); \
...: final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])
1 loop, best of 3: 62.7 ms per loop
My guess is that aggregation and unstacking is faster than either pivot_table() or pd.get_dummies().