How does Spark paralellize slices to tasks/executors/workers?

后端 未结 3 1147
无人共我
无人共我 2020-12-29 10:44

I have a 2-node Spark cluster with 4 cores per node.

        MASTER
(Worker-on-master)              (Worker-on-node1)

Spark config:

3条回答
  •  佛祖请我去吃肉
    2020-12-29 11:17

    I will try to answer your question as best I can:

    1.- Where can I see task level details?

    When submitting a job, Spark stores information about the task breakdown on each worker node, apart from the master. This data is stored, I believe (I have only tested with Spark for EC2), on the work folder under the spark directory.

    2.- How to programmatically find the working set size for the map function?

    Although I am not sure if it stores the size in memory of the slices, the logs mentioned on the first answer provide information about the amount of lines each RDD partition contains.

    3.- Are the multiple tasks run by an executor run sequentially or paralelly in multiple threads?

    I believe diferent tasks inside a node run sequentially. This is shown on the logs indicated above, which indicate the start and end time of every task.

    4.- Reasoning behind 2-4 slices per CPU

    Some nodes finish their tasks faster than others. Having more slices than available cores distributes the tasks in a balanced way avoiding long processing time due to slower nodes.

提交回复
热议问题