How to explode a list inside a Dataframe cell into separate rows

后端 未结 11 2274
天命终不由人
天命终不由人 2020-11-22 10:20

I\'m looking to turn a pandas cell containing a list into rows for each of those values.

So, take this:

If I\'d like to unpack and stack the value

11条回答
  •  孤独总比滥情好
    2020-11-22 11:10

    The fastest method I found so far is extending the DataFrame with .iloc and assigning back the flattened target column.

    Given the usual input (replicated a bit):

    df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                        'opponent': ['76ers', 'blazers', 'bobcats'], 
                        'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
          .set_index(['name', 'opponent']))
    df = pd.concat([df]*10)
    
    df
    Out[3]: 
                                                       nearest_neighbors
    name       opponent                                                 
    A.J. Price 76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
               blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
               bobcats   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
               76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
               blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
    ...
    

    Given the following suggested alternatives:

    col_target = 'nearest_neighbors'
    
    def extend_iloc():
        # Flatten columns of lists
        col_flat = [item for sublist in df[col_target] for item in sublist] 
        # Row numbers to repeat 
        lens = df[col_target].apply(len)
        vals = range(df.shape[0])
        ilocations = np.repeat(vals, lens)
        # Replicate rows and add flattened column of lists
        cols = [i for i,c in enumerate(df.columns) if c != col_target]
        new_df = df.iloc[ilocations, cols].copy()
        new_df[col_target] = col_flat
        return new_df
    
    def melt():
        return (pd.melt(df[col_target].apply(pd.Series).reset_index(), 
                 id_vars=['name', 'opponent'],
                 value_name=col_target)
                .set_index(['name', 'opponent'])
                .drop('variable', axis=1)
                .dropna()
                .sort_index())
    
    def stack_unstack():
        return (df[col_target].apply(pd.Series)
                .stack()
                .reset_index(level=2, drop=True)
                .to_frame(col_target))
    

    I find that extend_iloc() is the fastest:

    %timeit extend_iloc()
    3.11 ms ± 544 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit melt()
    22.5 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit stack_unstack()
    11.5 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

提交回复
热议问题