I\'m running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, \"
Okay, after a lot of experimentation, I was able to track down the problem. I'm going to report my findings here to help people avoid frustration in the future.
maximizeResourceAllocation is set, when you run a Spark program, it sets the property spark.default.parallelism to be the number of instance cores (or "vCPUs") for all the non-master instances that were in the cluster at the time of creation. This is probably too small even in normal cases; I've heard that it is recommended to set this at 4x the number of cores you will have to run your jobs. This will help make sure that there are enough tasks available during any given stage to keep the CPUs busy on all executors.spark.default.parallelism setting at runtime, this can be a convenient number to repartition to.TL;DR
maximizeResourceAllocation will do almost everything for you correctly except...spark.default.parallelism to 4x number of instance cores you want the job to run on on a per "step" (in EMR speak)/"application" (in YARN speak) basis, i.e. set it every time and...