Strategy for partitioning dask dataframes efficiently

后端 未结 3 2051
臣服心动
臣服心动 2020-12-28 16:32

The documentation for Dask talks about repartioning to reduce overhead here.

They however seem to indicate you need some knowledge of what your dataframe will look l

3条回答
  •  半阙折子戏
    2020-12-28 16:57

    After discussion with mrocklin a decent strategy for partitioning is to aim for 100MB partition sizes guided by df.memory_usage().sum().compute(). With datasets that fit in RAM the additional work this might involve can be mitigated with use of df.persist() placed at relevant points.

提交回复
热议问题