Strategy for partitioning dask dataframes efficiently

心已入冬 提交于 2019-12-03 03:17:34

Just to add to Samantha Hughes' answer:

memory_usage() by default ignores memory consumption of object dtype columns. For the datasets I have been working with recently this leads to an underestimate of memory usage of about 10x.

Unless you are sure there are no object dtype columns I would suggest specifying deep=True, that is, repartition using:

df.repartition(npartitions= 1+df.memory_usage(deep=True).sum().compute() // n )

Where n is your target partition size in bytes. Adding 1 ensures the number of partitions is always greater than 1 (// performs floor division).

After discussion with mrocklin a decent strategy for partitioning is to aim for 100MB partition sizes guided by df.memory_usage().sum().compute(). With datasets that fit in RAM the additional work this might involve can be mitigated with use of df.persist() placed at relevant points.

Wes Roach

As of Dask 2.0.0 you may call .repartition(partition_size="100MB").

This method performs an object-considerate (.memory_usage(deep=True)) breakdown of partition size. It will join smaller partitions, or split partitions that have grown too large.

Dask's Documentation also outlines the usage.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!