问题
Maybe this question is a little bit strange... But I'll try to ask it.
I have a Spark application and I test it on a different count of computing nodes. (This count I change from one to four nodes).
All nodes are equal - they have equal CPUs and equal size of RAM.
All application settings (like parallelism level or partitions count) are constantly.
Here the results of application time processing depending on the number of computing nodes:
1 node -- 127 minutes
2 nodes -- 71 minutes
3 nodes -- 51 minutes
4 nodes -- 38 minutes
Approximation of the results and their subsequent extrapolation show, that time processing is exponentially decreasing with linear increasing number of nodes. So, the duration of application time processing will not be significantly affected by increasing nodes count in the limit...
Could anyone explain this fact?
Thank You!
回答1:
First off, this heavily depends on the type of your job. Is it I/O bound? Then adding more CPUs won't help much. Adding more nodes will help, but still, the disks are limiting the performance of the job.
Secondly, for every node you add, there will be overhead, e.g. executor and task launching, scheduling, and so on. You also have network transfers between the nodes, especially if your job has multiple shuffles.
You can also try to increase parallelism so more nodes and more CPUs can actually be taken advantage of. But in general it's difficult to achieve 100% parallelization, especially in a young project like Spark.
来源:https://stackoverflow.com/questions/29520841/application-time-processing-depending-on-the-number-of-computing-nodes