问题
A paper "Making Sense of Performance in Data Analytics Frameworks" published in NSDI 2015 gives the conclusion that CPU(not IO or network) is the performance bottleneck of Spark. Kay has done some experiments on Spark including BDbench ,TPC-DS and a procdution workload(only Spark SQL is used?) in this paper. I wonder whether this conclusion is right for some frameworks built on Spark(like Streaming,with a continuous data stream received through network,both network IO and disk will suffer high pressure ).
回答1:
Network and disk may suffer less pressure in Spark Streaming because the streams are usually checkpointed, meaning all data is not usually kept around forever.
But ultimately, this is a research question : the only way to settle this one is to benchmark. Kay's code is open-source.
回答2:
It really depends on the job that you execute. you will need to analyze the job you write and see where the pressure and bottlenecks are. For instance I recently had a job that didn't have enough memory on the workers so it also had to spill to disk which increased its overall IO by a lot. When I removed the memory problem CPU was the next prob. tighter code moved the problem to IO etc.
来源:https://stackoverflow.com/questions/30254668/performance-bottleneck-of-spark