How to calculate the best numberOfPartitions for coalesce?

后端 未结 3 799
没有蜡笔的小新
没有蜡笔的小新 2020-11-29 04:40

So, I understand that in general one should use coalesce() when:

the number of partitions decreases due to a filter or some

3条回答
  •  情深已故
    2020-11-29 05:17

    Your question is a valid one, but Spark partitioning optimization depends entirely on the computation you're running. You need to have a good reason to repartition/coalesce; if you're just counting an RDD (even if it has a huge number of sparsely populated partitions), then any repartition/coalesce step is just going to slow you down.

    Repartition vs coalesce

    The difference between repartition(n) (which is the same as coalesce(n, shuffle = true) and coalesce(n, shuffle = false) has to do with execution model. The shuffle model takes each partition in the original RDD, randomly sends its data around to all executors, and results in an RDD with the new (smaller or greater) number of partitions. The no-shuffle model creates a new RDD which loads multiple partitions as one task.

    Let's consider this computation:

    sc.textFile("massive_file.txt")
      .filter(sparseFilterFunction) // leaves only 0.1% of the lines
      .coalesce(numPartitions, shuffle = shuffle)
    

    If shuffle is true, then the text file / filter computations happen in a number of tasks given by the defaults in textFile, and the tiny filtered results are shuffled. If shuffle is false, then the number of total tasks is at most numPartitions.

    If numPartitions is 1, then the difference is quite stark. The shuffle model will process and filter the data in parallel, then send the 0.1% of filtered results to one executor for downstream DAG operations. The no-shuffle model will process and filter the data all on one core from the beginning.

    Steps to take

    Consider your downstream operations. If you're just using this dataset once, then you probably don't need to repartition at all. If you are saving the filtered RDD for later use (to disk, for example), then consider the tradeoffs above. It takes experience to become familiar with these models and when one performs better, so try both out and see how they perform!

提交回复
热议问题