发表新帖

发表新帖

How does Spark paralellize slices to tasks/executors/workers?

后端未结

关注

 3  1147

无人共我 2020-12-29 10:44

I have a 2-node Spark cluster with 4 cores per node.

        MASTER
(Worker-on-master)              (Worker-on-node1)

Spark config:

3条回答

佛祖请我去吃肉 (楼主)

2020-12-29 11:17

I will try to answer your question as best I can:

1.- Where can I see task level details?

When submitting a job, Spark stores information about the task breakdown on each worker node, apart from the master. This data is stored, I believe (I have only tested with Spark for EC2), on the work folder under the spark directory.

2.- How to programmatically find the working set size for the map function?

Although I am not sure if it stores the size in memory of the slices, the logs mentioned on the first answer provide information about the amount of lines each RDD partition contains.

3.- Are the multiple tasks run by an executor run sequentially or paralelly in multiple threads?

I believe diferent tasks inside a node run sequentially. This is shown on the logs indicated above, which indicate the start and end time of every task.

4.- Reasoning behind 2-4 slices per CPU

Some nodes finish their tasks faster than others. Having more slices than available cores distributes the tasks in a balanced way avoiding long processing time due to slower nodes.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题