Splitting multiple columns into rows in pandas dataframe

后端 未结 5 1746
独厮守ぢ
独厮守ぢ 2020-12-03 09:16

I have a pandas dataframe as follows:

ticker    account      value         date
aa       assets       100,200       20121231, 20131231
bb       liabilities           


        
5条回答
  •  不思量自难忘°
    2020-12-03 09:41

    I'm noticing this question a lot. That is, how do I split this column that has a list into multiple rows? I've seen it called exploding. Here are some links:

    • https://stackoverflow.com/a/38432346/2336654
    • https://stackoverflow.com/a/38499036/2336654

    So I wrote a function that will do it.

    def explode(df, columns):
        idx = np.repeat(df.index, df[columns[0]].str.len())
        a = df.T.reindex_axis(columns).values
        concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
        p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
        return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)
    

    But before we can use it, we need lists (or iterable) in a column.

    Setup

    df = pd.DataFrame([['aa', 'assets',      '100,200', '20121231,20131231'],
                       ['bb', 'liabilities', '50,50',   '20141231,20131231']],
                      columns=['ticker', 'account', 'value', 'date'])
    
    df
    

    split value and date columns:

    df.value = df.value.str.split(',')
    df.date = df.date.str.split(',')
    
    df
    

    Now we could explode on either column or both, one after the other.

    Solution

    explode(df, ['value','date'])
    


    Timing

    I removed strip from @jezrael's timing because I could not effectively add it to mine. This is a necessary step for this question as OP has spaces in strings after commas. I was aiming at providing a generic way to explode a column given it already has iterables in it and I think I've accomplished that.

    code

    def get_df(n=1):
        return pd.DataFrame([['aa', 'assets',      '100,200,200', '20121231,20131231,20131231'],
                             ['bb', 'liabilities', '50,50',   '20141231,20131231']] * n,
                            columns=['ticker', 'account', 'value', 'date'])
    

    small 2 row sample

    medium 200 row sample

    large 2,000,000 row sample

提交回复
热议问题