Performance bottleneck of Spark

China☆狼群 提交于 2019-12-11 12:15:13

问题


A paper "Making Sense of Performance in Data Analytics Frameworks" published in NSDI 2015 gives the conclusion that CPU(not IO or network) is the performance bottleneck of Spark. Kay has done some experiments on Spark including BDbench ,TPC-DS and a procdution workload(only Spark SQL is used?) in this paper. I wonder whether this conclusion is right for some frameworks built on Spark(like Streaming,with a continuous data stream received through network,both network IO and disk will suffer high pressure ).


回答1:


Network and disk may suffer less pressure in Spark Streaming because the streams are usually checkpointed, meaning all data is not usually kept around forever.

But ultimately, this is a research question : the only way to settle this one is to benchmark. Kay's code is open-source.




回答2:


It really depends on the job that you execute. you will need to analyze the job you write and see where the pressure and bottlenecks are. For instance I recently had a job that didn't have enough memory on the workers so it also had to spill to disk which increased its overall IO by a lot. When I removed the memory problem CPU was the next prob. tighter code moved the problem to IO etc.



来源:https://stackoverflow.com/questions/30254668/performance-bottleneck-of-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!