application time processing depending on the number of computing nodes

╄→гoц情女王★ 提交于 2021-01-27 21:14:17

问题


Maybe this question is a little bit strange... But I'll try to ask it.

I have a Spark application and I test it on a different count of computing nodes. (This count I change from one to four nodes).

All nodes are equal - they have equal CPUs and equal size of RAM.

All application settings (like parallelism level or partitions count) are constantly.

Here the results of application time processing depending on the number of computing nodes:

1 node -- 127 minutes

2 nodes -- 71 minutes

3 nodes -- 51 minutes

4 nodes -- 38 minutes

Approximation of the results and their subsequent extrapolation show, that time processing is exponentially decreasing with linear increasing number of nodes. So, the duration of application time processing will not be significantly affected by increasing nodes count in the limit...

Could anyone explain this fact?

Thank You!


回答1:


First off, this heavily depends on the type of your job. Is it I/O bound? Then adding more CPUs won't help much. Adding more nodes will help, but still, the disks are limiting the performance of the job.

Secondly, for every node you add, there will be overhead, e.g. executor and task launching, scheduling, and so on. You also have network transfers between the nodes, especially if your job has multiple shuffles.

You can also try to increase parallelism so more nodes and more CPUs can actually be taken advantage of. But in general it's difficult to achieve 100% parallelization, especially in a young project like Spark.



来源:https://stackoverflow.com/questions/29520841/application-time-processing-depending-on-the-number-of-computing-nodes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!