Spark on EMR : Time for running data in EMR didn't reduce when no of nodes increases

问题

My Spark program take a large amount of zip files that contain JSON data from S3. It performs some cleaning on the data in the form of spark transforms. After that, I saved it as parquet files. When I run my program with 1GB data in 10 nodes 8GB configurations in AWS it takes about 11 min. I changed it to 20 nodes 32GB configuration. Still it takes about 10 min. Reduced only around 1 min. Why this kind of behavior?

回答1:

Because adding more machines isn't always the solution, adding more machine leads to unnecessary data transfer over the network which can be the bottleneck in most cases.

Also 1GB of data isn't that big to perform scalability and performance benchmarking.

来源：https://stackoverflow.com/questions/35987974/spark-on-emr-time-for-running-data-in-emr-didnt-reduce-when-no-of-nodes-incre

标签

amazon-web-services

amazon-s3

apache-spark

emr

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!