Extremely slow S3 write times from EMR/ Spark

前端 未结 6 1140
梦如初夏
梦如初夏 2020-12-23 12:20

I\'m writing to see if anyone knows how to speed up S3 write times from Spark running in EMR?

My Spark Job takes over 4 hours to complete, however the cluster is onl

6条回答
  •  心在旅途
    2020-12-23 12:47

    The direct committer was pulled from spark as it wasn't resilient to failures. I would strongly advice against using it.

    There is work ongoing in Hadoop, s3guard, to add 0-rename committers, which will be O(1) and fault tolerant; keep an eye on HADOOP-13786.

    Ignoring "the Magic committer" for now, the Netflix-based staging committer will ship first (hadoop 2.9? 3.0?)

    1. This writes the work to the local FS, in task commit
    2. issues uncommitted multipart put operations to write the data, but not materialize it.
    3. saves the information needed to commit the PUT to HDFS, using the original "algorithm 1" file output committer
    4. Implements a job commit which uses the file output commit of HDFS to decide which PUTs to complete, and which to cancel.

    Result: task commit takes data/bandwith seconds, but job commit takes no longer than the time to do 1-4 GETs on the destination folder and a POST for every pending file, the latter being parallelized.

    You can pick up the committer which this work is based on, from netflix, and probably use it in spark today. Do set the file commit algorithm = 1 (should be the default) or it wont actually write the data.

提交回复
热议问题