pandas: How do I split text in a column into multiple rows?

后端 未结 7 1363
说谎
说谎 2020-11-22 09:47

I\'m working with a large csv file and the next to last column has a string of text that I want to split by a specific delimiter. I was wondering if there is a simple way to

7条回答
  •  猫巷女王i
    2020-11-22 10:11

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'ItemQty': {0: 3, 1: 25}, 
                       'Seatblocks': {0: '2:218:10:4,6', 1: '1:13:36:1,12 1:13:37:1,13'}, 
                       'ItemExt': {0: 60, 1: 300}, 
                       'CustomerName': {0: 'McCartney, Paul', 1: 'Lennon, John'}, 
                       'CustNum': {0: 32363, 1: 31316}, 
                       'Item': {0: 'F04', 1: 'F01'}}, 
                        columns=['CustNum','CustomerName','ItemQty','Item','Seatblocks','ItemExt'])
    
    print (df)
       CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt
    0    32363  McCartney, Paul        3  F04               2:218:10:4,6       60
    1    31316     Lennon, John       25  F01  1:13:36:1,12 1:13:37:1,13      300
    

    Another similar solution with chaining is use reset_index and rename:

    print (df.drop('Seatblocks', axis=1)
                 .join
                 (
                 df.Seatblocks
                 .str
                 .split(expand=True)
                 .stack()
                 .reset_index(drop=True, level=1)
                 .rename('Seatblocks')           
                 ))
    
       CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks
    0    32363  McCartney, Paul        3  F04       60  2:218:10:4,6
    1    31316     Lennon, John       25  F01      300  1:13:36:1,12
    1    31316     Lennon, John       25  F01      300  1:13:37:1,13
    

    If in column are NOT NaN values, the fastest solution is use list comprehension with DataFrame constructor:

    df = pd.DataFrame(['a b c']*100000, columns=['col'])
    
    In [141]: %timeit (pd.DataFrame(dict(zip(range(3), [df['col'].apply(lambda x : x.split(' ')[i]) for i in range(3)]))))
    1 loop, best of 3: 211 ms per loop
    
    In [142]: %timeit (pd.DataFrame(df.col.str.split().tolist()))
    10 loops, best of 3: 87.8 ms per loop
    
    In [143]: %timeit (pd.DataFrame(list(df.col.str.split())))
    10 loops, best of 3: 86.1 ms per loop
    
    In [144]: %timeit (df.col.str.split(expand=True))
    10 loops, best of 3: 156 ms per loop
    
    In [145]: %timeit (pd.DataFrame([ x.split() for x in df['col'].tolist()]))
    10 loops, best of 3: 54.1 ms per loop
    

    But if column contains NaN only works str.split with parameter expand=True which return DataFrame (documentation), and it explain why it is slowier:

    df = pd.DataFrame(['a b c']*10, columns=['col'])
    df.loc[0] = np.nan
    print (df.head())
         col
    0    NaN
    1  a b c
    2  a b c
    3  a b c
    4  a b c
    
    print (df.col.str.split(expand=True))
         0     1     2
    0  NaN  None  None
    1    a     b     c
    2    a     b     c
    3    a     b     c
    4    a     b     c
    5    a     b     c
    6    a     b     c
    7    a     b     c
    8    a     b     c
    9    a     b     c
    

提交回复
热议问题