dask set_index from large unordered csv file

问题

At the risk of being a bit off-topic, I want to show a simple solution for loading large csv files in a dask dataframe where the option sorted=True can be applied and save a significant time of processing.

I found the option of doing set_index within dask unworkable for the size of the toy cluster I am using for learning and the size of the files (33GB).

So if your problem is loading large unsorted CSV files, ( multiple tens of gigabytes ), into a dask dataframe and quickly start performing groupbys my suggestion is to previously sort them with the unix command "sort".

sort processing needs are negligible and it will not push your RAM limits beyond unmanageable limits. You can define the number of parallel processes to run/sort as well as the ram consumed as buffer. In as far you have disk space, this rocks.

The trick here is to export LC_ALL=C in your environment prior to issue the command. Either wise, pandas/dask sort and unix sort will produce different results.

Here is the code I have used

export LC_ALL=C

zcat BigFat.csv.gz |
fgrep -v ( have headers?? take them away)|
sort -key=1,1 -t "," ( fancy multi field sorting/index ? -key=3,3 -key=4,4)|
split -l 10000000 ( partitions ??)

The result is ready for a

ddf=dd.read_csv(.....)
ddf.set_index(ddf.mykey,sorted=True)

Hope this helps

回答1:

As discussed above, I am just posting this as a solution to my problem. Hope works for others.

I am not claiming this is the best, most efficient or more pythonic! :-)

来源：https://stackoverflow.com/questions/46971219/dask-set-index-from-large-unordered-csv-file

标签

python

csv

sorting

indexing

dask