Fastest way to parse large CSV files in Pandas

后端 未结 3 1125
悲哀的现实
悲哀的现实 2020-12-09 08:22

I am using pandas to analyse the large data files here: http://www.nielda.co.uk/betfair/data/ They are around 100 megs in size.

Each load from csv takes a few second

3条回答
  •  难免孤独
    2020-12-09 08:58

    As @chrisb said, pandas' read_csv is probably faster than csv.reader/numpy.genfromtxt/loadtxt. I don't think you will find something better to parse the csv (as a note, read_csv is not a 'pure python' solution, as the CSV parser is implemented in C).

    But, if you have to load/query the data often, a solution would be to parse the CSV only once and then store it in another format, eg HDF5. You can use pandas (with PyTables in background) to query that efficiently (docs).
    See here for a comparison of the io performance of HDF5, csv and SQL with pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

    And a possibly relevant other question: "Large data" work flows using pandas

提交回复
热议问题