Fastest way to parse large CSV files in Pandas

后端 未结 3 1116
悲哀的现实
悲哀的现实 2020-12-09 08:22

I am using pandas to analyse the large data files here: http://www.nielda.co.uk/betfair/data/ They are around 100 megs in size.

Each load from csv takes a few second

相关标签:
3条回答
  • 2020-12-09 08:58

    As @chrisb said, pandas' read_csv is probably faster than csv.reader/numpy.genfromtxt/loadtxt. I don't think you will find something better to parse the csv (as a note, read_csv is not a 'pure python' solution, as the CSV parser is implemented in C).

    But, if you have to load/query the data often, a solution would be to parse the CSV only once and then store it in another format, eg HDF5. You can use pandas (with PyTables in background) to query that efficiently (docs).
    See here for a comparison of the io performance of HDF5, csv and SQL with pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

    And a possibly relevant other question: "Large data" work flows using pandas

    0 讨论(0)
  • 2020-12-09 09:05

    One thing to check is the actual performance of the disk system itself. Especially if you use spinning disks (not SSD), your practical disk read speed may be one of the explaining factors for the performance. So, before doing too much optimization, check if reading the same data into memory (by, e.g., mydata = open('myfile.txt').read()) takes an equivalent amount of time. (Just make sure you do not get bitten by disk caches; if you load the same data twice, the second time it will be much faster because the data is already in RAM cache.)

    See the update below before believing what I write underneath

    If your problem is really parsing of the files, then I am not sure if any pure Python solution will help you. As you know the actual structure of the files, you do not need to use a generic CSV parser.

    There are three things to try, though:

    1. Python csv package and csv.reader
    2. NumPy genfromtext
    3. Numpy loadtxt

    The third one is probably fastest if you can use it with your data. At the same time it has the most limited set of features. (Which actually may make it fast.)

    Also, the suggestions given you in the comments by crclayton, BKay, and EdChum are good ones.

    Try the different alternatives! If they do not work, then you will have to do write something in a compiled language (either compiled Python or, e.g. C).

    Update: I do believe what chrisb says below, i.e. the pandas parser is fast.

    Then the only way to make the parsing faster is to write an application-specific parser in C (or other compiled language). Generic parsing of CSV files is not straightforward, but if the exact structure of the file is known there may be shortcuts. In any case parsing text files is slow, so if you ever can translate it into something more palatable (HDF5, NumPy array), loading will be only limited by the I/O performance.

    0 讨论(0)
  • 2020-12-09 09:22

    Modin is an early-stage project at UC Berkeley’s RISELab designed to facilitate the use of distributed computing for Data Science. It is a multiprocess Dataframe library with an identical API to pandas that allows users to speed up their Pandas workflows. Modin accelerates Pandas queries by 4x on an 8-core machine, only requiring users to change a single line of code in their notebooks.

    pip install modin
    

    if using dask

    pip install modin[dask]
    

    import modin by typing

    import modin.pandas as pd
    

    It uses all CPU cores to import csv file and it is almost like pandas.

    0 讨论(0)
提交回复
热议问题