Combine two columns of text in pandas dataframe

后端 未结 18 1315
-上瘾入骨i
-上瘾入骨i 2020-11-22 01:32

I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year and quarter. I\'d like to create a variable called p

18条回答
  •  渐次进展
    2020-11-22 01:44

    Small data-sets (< 150rows)

    [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
    

    or slightly slower but more compact:

    df.Year.str.cat(df.quarter)
    

    Larger data sets (> 150rows)

    df['Year'].astype(str) + df['quarter']
    

    UPDATE: Timing graph Pandas 0.23.4

    Let's test it on 200K rows DF:

    In [250]: df
    Out[250]:
       Year quarter
    0  2014      q1
    1  2015      q2
    
    In [251]: df = pd.concat([df] * 10**5)
    
    In [252]: df.shape
    Out[252]: (200000, 2)
    

    UPDATE: new timings using Pandas 0.19.0

    Timing without CPU/GPU optimization (sorted from fastest to slowest):

    In [107]: %timeit df['Year'].astype(str) + df['quarter']
    10 loops, best of 3: 131 ms per loop
    
    In [106]: %timeit df['Year'].map(str) + df['quarter']
    10 loops, best of 3: 161 ms per loop
    
    In [108]: %timeit df.Year.str.cat(df.quarter)
    10 loops, best of 3: 189 ms per loop
    
    In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
    1 loop, best of 3: 567 ms per loop
    
    In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
    1 loop, best of 3: 584 ms per loop
    
    In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
    1 loop, best of 3: 24.7 s per loop
    

    Timing using CPU/GPU optimization:

    In [113]: %timeit df['Year'].astype(str) + df['quarter']
    10 loops, best of 3: 53.3 ms per loop
    
    In [114]: %timeit df['Year'].map(str) + df['quarter']
    10 loops, best of 3: 65.5 ms per loop
    
    In [115]: %timeit df.Year.str.cat(df.quarter)
    10 loops, best of 3: 79.9 ms per loop
    
    In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
    1 loop, best of 3: 230 ms per loop
    
    In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
    1 loop, best of 3: 230 ms per loop
    
    In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
    1 loop, best of 3: 9.38 s per loop
    

    Answer contribution by @anton-vbr

提交回复
热议问题