Pandas expand rows from list data available in column

前端 未结 3 1961
陌清茗
陌清茗 2020-11-30 07:18

I have a data frame like this in pandas:

 column1      column2
 [a,b,c]        1
 [d,e,f]        2
 [g,h,i]        3

Expected outp

相关标签:
3条回答
  • 2020-11-30 07:48

    You can create DataFrame by its constructor and stack:

     df2 = pd.DataFrame(df.column1.tolist(), index=df.column2)
            .stack()
            .reset_index(level=1, drop=True)
            .reset_index(name='column1')[['column1','column2']]
    print (df2)
    
      column1  column2
    0       a        1
    1       b        1
    2       c        1
    3       d        2
    4       e        2
    5       f        2
    6       g        3
    7       h        3
    8       i        3
    

    If need change ordering by subset [['column1','column2']], you can also omit first reset_index:

    df2 = pd.DataFrame(df.column1.tolist(), index=df.column2)
            .stack()
            .reset_index(name='column1')[['column1','column2']]
    print (df2)
      column1  column2
    0       a        1
    1       b        1
    2       c        1
    3       d        2
    4       e        2
    5       f        2
    6       g        3
    7       h        3
    8       i        3
    

    Another solution DataFrame.from_records for creating DataFrame from first column, then create Series by stack and join to original DataFrame:

    df = pd.DataFrame({'column1': [['a','b','c'],['d','e','f'],['g','h','i']],
                       'column2':[1,2,3]})
    
    
    a = pd.DataFrame.from_records(df.column1.tolist())
                    .stack()
                    .reset_index(level=1, drop=True)
                    .rename('column1')
    
    print (a)
    0    a
    0    b
    0    c
    1    d
    1    e
    1    f
    2    g
    2    h
    2    i
    Name: column1, dtype: object
    
    print (df.drop('column1', axis=1)
             .join(a)
             .reset_index(drop=True)[['column1','column2']])
    
      column1  column2
    0       a        1
    1       b        1
    2       c        1
    3       d        2
    4       e        2
    5       f        2
    6       g        3
    7       h        3
    8       i        3
    
    0 讨论(0)
  • 2020-11-30 08:03

    Another solution is to use the result_type='expand' argument of the pandas.apply function available since pandas 0.23. Answering @splinter's question this method can be generalized -- see below:

    import pandas as pd
    from numpy import arange
    
    df = pd.DataFrame(
        {'column1' : [['a','b','c'],['d','e','f'],['g','h','i']],
        'column2': [1,2,3]}
    )
    
    pd.melt(
        df.join(
            df.apply(lambda row: row['column1'], axis=1, result_type='expand')
            ),
     value_vars=arange(df['column1'].shape[0]), value_name='column1', var_name='column2')[['column1','column2']]
    
    # can be generalized 
    
    df = pd.DataFrame(
        {'column1' : [['a','b','c'],['d','e','f'],['g','h','i']],
        'column2': [1,2,3],
        'column3': [[1,2],[2,3],[3,4]],
        'column4': [42,23,321],
        'column5': ['a','b','c']}
    )
    
    (pd.melt(
        df.join(
            df.apply(lambda row: row['column1'], axis=1, result_type='expand')
            ),
     value_vars=arange(df['column1'].shape[0]), value_name='column1', id_vars=df.columns[1:])
     .drop(columns=['variable'])[list(df.columns[:1]) + list(df.columns[1:])]
     .sort_values(by=['column1']))
    

    UPDATE (for Jwely's comment): if you have lists with varying length, you can do:

    df = pd.DataFrame(
        {'column1' : [['a','b','c'],['d','f'],['g','h','i']],
        'column2': [1,2,3]}
    )
    
    longest = max(df['column1'].apply(lambda x: len(x)))
    
    pd.melt(
        df.join(
            df.apply(lambda row: row['column1'] if len(row['column1']) >= longest else row['column1'] + [None] * (longest - len(row['column1'])), axis=1, result_type='expand')
        ),
     value_vars=arange(df['column1'].shape[0]), value_name='column1', var_name='column2').query("column1 == column1")[['column1','column2']]
    
    0 讨论(0)
  • 2020-11-30 08:04

    DataFrame.explode

    Since pandas >= 0.25.0 we have the explode method for this, which expands a list to a row for each element and repeats the rest of the columns:

    df.explode('column1').reset_index(drop=True)
    

    Output

    
      column1  column2
    0       a        1
    1       b        1
    2       c        1
    3       d        2
    4       e        2
    5       f        2
    6       g        3
    7       h        3
    8       i        3
    

    Since pandas >= 1.1.0 we have the ignore_index argument, so we don't have to chain with reset_index:

    df.explode('column1', ignore_index=True)
    

    Output

      column1  column2
    0       a        1
    1       b        1
    2       c        1
    3       d        2
    4       e        2
    5       f        2
    6       g        3
    7       h        3
    8       i        3
    
    0 讨论(0)
提交回复
热议问题