Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?

前端 未结 8 1313
面向向阳花
面向向阳花 2020-12-07 09:55

I\'m running a Spark job with in a speculation mode. I have around 500 tasks and around 500 files of 1 GB gz compressed. I keep getting in each job, for 1-2 tasks, the attac

8条回答
  •  眼角桃花
    2020-12-07 10:17

    I got the same issue on my 3 machine YARN cluster. I kept changing RAM but the issue persisted. Finally I saw the following messages in the logs:

    17/02/20 13:11:02 WARN spark.HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 1006275 ms exceeds timeout 1000000 ms
    17/02/20 13:11:02 ERROR cluster.YarnScheduler: Lost executor 2 on 1worker.com: Executor heartbeat timed out after 1006275 ms
    

    and after this, there was this message:

    org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 67
    

    I modified the properties in spark-defaults.conf as follows:

    spark.yarn.scheduler.heartbeat.interval-ms 7200000
    spark.executor.heartbeatInterval 7200000
    spark.network.timeout 7200000
    

    That's it! My job completed successfully after this.

提交回复
热议问题