Fastest way to parse large CSV files in Pandas

后端未结

关注

 3  1122

I am using pandas to analyse the large data files here: http://www.nielda.co.uk/betfair/data/ They are around 100 megs in size.

Each load from csv takes a few second

相关标签:

3条回答

难免孤独

2020-12-09 08:58

As @chrisb said, pandas' read_csv is probably faster than csv.reader/numpy.genfromtxt/loadtxt. I don't think you will find something better to parse the csv (as a note, read_csv is not a 'pure python' solution, as the CSV parser is implemented in C).

But, if you have to load/query the data often, a solution would be to parse the CSV only once and then store it in another format, eg HDF5. You can use pandas (with PyTables in background) to query that efficiently (docs).
See here for a comparison of the io performance of HDF5, csv and SQL with pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

And a possibly relevant other question: "Large data" work flows using pandas

0 讨论(0)
发布评论:

提交评论
- 加载中...
南旧

2020-12-09 09:05
One thing to check is the actual performance of the disk system itself. Especially if you use spinning disks (not SSD), your practical disk read speed may be one of the explaining factors for the performance. So, before doing too much optimization, check if reading the same data into memory (by, e.g., mydata = open('myfile.txt').read()) takes an equivalent amount of time. (Just make sure you do not get bitten by disk caches; if you load the same data twice, the second time it will be much faster because the data is already in RAM cache.)

See the update below before believing what I write underneath

If your problem is really parsing of the files, then I am not sure if any pure Python solution will help you. As you know the actual structure of the files, you do not need to use a generic CSV parser.

There are three things to try, though:
1. Python csv package and csv.reader
2. NumPy genfromtext
3. Numpy loadtxt
The third one is probably fastest if you can use it with your data. At the same time it has the most limited set of features. (Which actually may make it fast.)

Also, the suggestions given you in the comments by crclayton, BKay, and EdChum are good ones.

Try the different alternatives! If they do not work, then you will have to do write something in a compiled language (either compiled Python or, e.g. C).

Update: I do believe what chrisb says below, i.e. the pandas parser is fast.

Then the only way to make the parsing faster is to write an application-specific parser in C (or other compiled language). Generic parsing of CSV files is not straightforward, but if the exact structure of the file is known there may be shortcuts. In any case parsing text files is slow, so if you ever can translate it into something more palatable (HDF5, NumPy array), loading will be only limited by the I/O performance.
0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2020-12-09 09:22
Modin is an early-stage project at UC Berkeley’s RISELab designed to facilitate the use of distributed computing for Data Science. It is a multiprocess Dataframe library with an identical API to pandas that allows users to speed up their Pandas workflows. Modin accelerates Pandas queries by 4x on an 8-core machine, only requiring users to change a single line of code in their notebooks.
```
pip install modin
```
if using dask
```
pip install modin[dask]
```
import modin by typing
```
import modin.pandas as pd
```
It uses all CPU cores to import csv file and it is almost like pandas.
0 讨论(0)
发布评论:

提交评论
- 加载中...