Spark coalesce relationship with number of executors and cores

问题

I'm bringing up a very silly question about Spark as I want to clear my confusion. I'm very new in Spark and still trying to understand how it works internally.

Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.

Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the task will work on one single partition independently?

Round: 1 12 executors each with 5 cores => 60 tasks process 60 partitions
Round: 2 8 executors each with 5 cores => 40 tasks

process the rest of the 40 partitions and 4 executors never place a job for the 2nd time

Or all tasks from the same executor will work on the same partition?

Round: 1: 12 executors => process 12 partitions
Round: 2: 12 executors => process 12 partitions
Round: 3: 12 executors => process 12 partitions
....
....
....
Round: 9 (96 partitions already processed): 4 executors => process the remaining 4 partitions

回答1:

Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.

In spark by default number of partitions = hdfs blocks, as coalesce(100) is specified, Spark will divide input data into 100 partitions.

Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the tasks will work on one single partition independently?

As you passed might be passed.

--num-executors 12: Number of executors to launch in an application.

--executor-cores 5 : Number of cores per executor. 1 core = 1 task at a time

So the execution of partitions will go like this.

Round 1

12 partitions will be processed by 12 executors with 5 tasks(threads) each.

Round 2

12 partitions will be processed by 12 executors with 5 tasks(threads) each.
.
.
.

Round: 9 (96 partitions already processed):

4 partitions will be processed by 4 executors with 5 tasks(threads) each.

NOTE: Usually, Some executors may complete assigned work quickly(various parameters like data locality, Network I/O, CPU, etc.). So, it will pick the next partition to process by waiting for a configured amount of scheduling time.

来源：https://stackoverflow.com/questions/38465692/spark-coalesce-relationship-with-number-of-executors-and-cores

标签

Hadoop

apache-spark

yarn