I\'m writing to see if anyone knows how to speed up S3 write times from Spark running in EMR?
My Spark Job takes over 4 hours to complete, however the cluster is onl
The direct committer was pulled from spark as it wasn't resilient to failures. I would strongly advice against using it.
There is work ongoing in Hadoop, s3guard, to add 0-rename committers, which will be O(1) and fault tolerant; keep an eye on HADOOP-13786.
Ignoring "the Magic committer" for now, the Netflix-based staging committer will ship first (hadoop 2.9? 3.0?)
Result: task commit takes data/bandwith seconds, but job commit takes no longer than the time to do 1-4 GETs on the destination folder and a POST for every pending file, the latter being parallelized.
You can pick up the committer which this work is based on, from netflix, and probably use it in spark today. Do set the file commit algorithm = 1 (should be the default) or it wont actually write the data.