How to calculate the best numberOfPartitions for coalesce?

后端未结

关注

 3  799

没有蜡笔的小新 2020-11-29 04:40

So, I understand that in general one should use coalesce() when:

the number of partitions decreases due to a filter or some

3条回答

情深已故 (楼主)

2020-11-29 05:17
Your question is a valid one, but Spark partitioning optimization depends entirely on the computation you're running. You need to have a good reason to repartition/coalesce; if you're just counting an RDD (even if it has a huge number of sparsely populated partitions), then any repartition/coalesce step is just going to slow you down.

Repartition vs coalesce

The difference between repartition(n) (which is the same as coalesce(n, shuffle = true) and coalesce(n, shuffle = false) has to do with execution model. The shuffle model takes each partition in the original RDD, randomly sends its data around to all executors, and results in an RDD with the new (smaller or greater) number of partitions. The no-shuffle model creates a new RDD which loads multiple partitions as one task.

Let's consider this computation:
```
sc.textFile("massive_file.txt")
  .filter(sparseFilterFunction) // leaves only 0.1% of the lines
  .coalesce(numPartitions, shuffle = shuffle)
```
If shuffle is true, then the text file / filter computations happen in a number of tasks given by the defaults in textFile, and the tiny filtered results are shuffled. If shuffle is false, then the number of total tasks is at most numPartitions.

If numPartitions is 1, then the difference is quite stark. The shuffle model will process and filter the data in parallel, then send the 0.1% of filtered results to one executor for downstream DAG operations. The no-shuffle model will process and filter the data all on one core from the beginning.

Steps to take

Consider your downstream operations. If you're just using this dataset once, then you probably don't need to repartition at all. If you are saving the filtered RDD for later use (to disk, for example), then consider the tradeoffs above. It takes experience to become familiar with these models and when one performs better, so try both out and see how they perform!
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

How to calculate the best numberOfPartitions for coalesce?

Repartition vs coalesce

Steps to take