What is the fastest way to output large DataFrame into a CSV file?

前端 未结 4 1418
北海茫月
北海茫月 2020-12-01 14:19

For python / pandas I find that df.to_csv(fname) works at a speed of ~1 mln rows per min. I can sometimes improve performance by a factor of 7 like this:

def         


        
4条回答
  •  Happy的楠姐
    2020-12-01 14:29

    Lev. Pandas has rewritten to_csv to make a big improvement in native speed. The process is now i/o bound, accounts for many subtle dtype issues, and quote cases. Here is our performance results vs. 0.10.1 (in the upcoming 0.11) release. These are in ms, lower ratio is better.

    Results:
                                                t_head  t_baseline      ratio
    name                                                                     
    frame_to_csv2 (100k) rows                 190.5260   2244.4260     0.0849
    write_csv_standard  (10k rows)             38.1940    234.2570     0.1630
    frame_to_csv_mixed  (10k rows, mixed)     369.0670   1123.0412     0.3286
    frame_to_csv (3k rows, wide)              112.2720    226.7549     0.4951
    

    So Throughput for a single dtype (e.g. floats), not too wide is about 20M rows / min, here is your example from above.

    In [12]: df = pd.DataFrame({'A' : np.array(np.arange(45000000),dtype='float64')}) 
    In [13]: df['B'] = df['A'] + 1.0   
    In [14]: df['C'] = df['A'] + 2.0
    In [15]: df['D'] = df['A'] + 2.0
    In [16]: %timeit -n 1 -r 1 df.to_csv('test.csv')
    1 loops, best of 1: 119 s per loop
    

提交回复
热议问题