How to calculate the best numberOfPartitions for coalesce?

后端 未结 3 806
没有蜡笔的小新
没有蜡笔的小新 2020-11-29 04:40

So, I understand that in general one should use coalesce() when:

the number of partitions decreases due to a filter or some

3条回答
  •  北荒
    北荒 (楼主)
    2020-11-29 05:37

    In practice optimal number of partitions depends more on the data you have, transformations you use and overall configuration than the available resources.

    • If the number of partitions is too low you'll experience long GC pauses, different types of memory issues, and lastly suboptimal resource utilization.
    • If the number of partitions is too high then maintenance cost can easily exceed processing cost. Moreover, if you use non-distributed reducing operations (like reduce in contrast to treeReduce), a large number of partitions results in a higher load on the driver.

    You can find a number of rules which suggest oversubscribing partitions compared to the number of cores (factor 2 or 3 seems to be common) or keeping partitions at a certain size but this doesn't take into account your own code:

    • If you allocate a lot you can expect long GC pauses and it is probably better to go with smaller partitions.
    • If a certain piece of code is expensive then your shuffle cost can be amortized by a higher concurrency.
    • If you have a filter you can adjust the number of partitions based on a discriminative power of the predicate (you make different decisions if you expect to retain 5% of the data and 99% of the data).

    In my opinion:

    • With one-off jobs keep higher number partitions to stay on the safe side (slower is better than failing).
    • With reusable jobs start with conservative configuration then execute - monitor - adjust configuration - repeat.
    • Don't try to use fixed number of partitions based on the number of executors or cores. First understand your data and code, then adjust configuration to reflect your understanding.

      Usually, it is relatively easy to determine the amount of raw data per partition for which your cluster exhibits stable behavior (in my experience it is somewhere in the range of few hundred megabytes, depending on the format, data structure you use to load data, and configuration). This is the "magic number" you're looking for.

    Some things you have to remember in general:

    • Number of partitions doesn't necessarily reflect data distribution. Any operation that requires shuffle (*byKey, join, RDD.partitionBy, Dataset.repartition) can result in non-uniform data distribution. Always monitor your jobs for symptoms of a significant data skew.
    • Number of partitions in general is not constant. Any operation with multiple dependencies (union, coGroup, join) can affect the number of partitions.

提交回复
热议问题