dask set_index from large unordered csv file

别来无恙 提交于 2020-07-18 18:40:27

问题


At the risk of being a bit off-topic, I want to show a simple solution for loading large csv files in a dask dataframe where the option sorted=True can be applied and save a significant time of processing.

I found the option of doing set_index within dask unworkable for the size of the toy cluster I am using for learning and the size of the files (33GB).

So if your problem is loading large unsorted CSV files, ( multiple tens of gigabytes ), into a dask dataframe and quickly start performing groupbys my suggestion is to previously sort them with the unix command "sort".

sort processing needs are negligible and it will not push your RAM limits beyond unmanageable limits. You can define the number of parallel processes to run/sort as well as the ram consumed as buffer. In as far you have disk space, this rocks.

The trick here is to export LC_ALL=C in your environment prior to issue the command. Either wise, pandas/dask sort and unix sort will produce different results.

Here is the code I have used

export LC_ALL=C

zcat BigFat.csv.gz |
fgrep -v ( have headers?? take them away)|
sort -key=1,1 -t "," ( fancy multi field sorting/index ? -key=3,3 -key=4,4)|
split -l 10000000 ( partitions ??)

The result is ready for a

ddf=dd.read_csv(.....)
ddf.set_index(ddf.mykey,sorted=True)

Hope this helps

JC


回答1:


As discussed above, I am just posting this as a solution to my problem. Hope works for others.

I am not claiming this is the best, most efficient or more pythonic! :-)



来源:https://stackoverflow.com/questions/46971219/dask-set-index-from-large-unordered-csv-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!