Expand pandas DataFrame column into multiple rows

前端 未结 7 1639
星月不相逢
星月不相逢 2020-12-05 04:50

If I have a DataFrame such that:

pd.DataFrame( {\"name\" : \"John\", 
               \"days\" : [[1, 3, 5, 7]]
              })
<
7条回答
  •  粉色の甜心
    2020-12-05 05:37

    You could use df.itertuples to iterate through each row, and use a list comprehension to reshape the data into the desired form:

    import pandas as pd
    
    df = pd.DataFrame( {"name" : ["John", "Eric"], 
                   "days" : [[1, 3, 5, 7], [2,4]]})
    result = pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])
    print(result)
    

    yields

       0     1
    0  1  John
    1  3  John
    2  5  John
    3  7  John
    4  2  Eric
    5  4  Eric
    

    Divakar's solution, using_repeat, is fastest:

    In [48]: %timeit using_repeat(df)
    1000 loops, best of 3: 834 µs per loop
    
    In [5]: %timeit using_itertuples(df)
    100 loops, best of 3: 3.43 ms per loop
    
    In [7]: %timeit using_apply(df)
    1 loop, best of 3: 379 ms per loop
    
    In [8]: %timeit using_append(df)
    1 loop, best of 3: 3.59 s per loop
    

    Here is the setup used for the above benchmark:

    import numpy as np
    import pandas as pd
    
    N = 10**3
    df = pd.DataFrame( {"name" : np.random.choice(list('ABCD'), size=N), 
                        "days" : [np.random.randint(10, size=np.random.randint(5))
                                  for i in range(N)]})
    
    def using_itertuples(df):
        return  pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])
    
    def using_repeat(df):
        lens = [len(item) for item in df['days']]
        return pd.DataFrame( {"name" : np.repeat(df['name'].values,lens), 
                              "days" : np.concatenate(df['days'].values)})
    
    def using_apply(df):
        return (df.apply(lambda x: pd.Series(x.days), axis=1)
                .stack()
                .reset_index(level=1, drop=1)
                .to_frame('day')
                .join(df['name']))
    
    def using_append(df):
        df2 = pd.DataFrame(columns = df.columns)
        for i,r in df.iterrows():
            for e in r.days:
                new_r = r.copy()
                new_r.days = e
                df2 = df2.append(new_r)
        return df2
    

提交回复
热议问题