Strategy for partitioning dask dataframes efficiently

后端 未结 3 2053
臣服心动
臣服心动 2020-12-28 16:32

The documentation for Dask talks about repartioning to reduce overhead here.

They however seem to indicate you need some knowledge of what your dataframe will look l

3条回答
  •  梦谈多话
    2020-12-28 17:08

    Just to add to Samantha Hughes' answer:

    memory_usage() by default ignores memory consumption of object dtype columns. For the datasets I have been working with recently this leads to an underestimate of memory usage of about 10x.

    Unless you are sure there are no object dtype columns I would suggest specifying deep=True, that is, repartition using:

    df.repartition(npartitions= 1+df.memory_usage(deep=True).sum().compute() // n )

    Where n is your target partition size in bytes. Adding 1 ensures the number of partitions is always greater than 1 (// performs floor division).

提交回复
热议问题