Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?

前端未结

关注

 8  1313

面向向阳花 2020-12-07 09:55

I\'m running a Spark job with in a speculation mode. I have around 500 tasks and around 500 files of 1 GB gz compressed. I keep getting in each job, for 1-2 tasks, the attac

8条回答

眼角桃花 (楼主)

2020-12-07 10:17

I got the same issue on my 3 machine YARN cluster. I kept changing RAM but the issue persisted. Finally I saw the following messages in the logs:

17/02/20 13:11:02 WARN spark.HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 1006275 ms exceeds timeout 1000000 ms
17/02/20 13:11:02 ERROR cluster.YarnScheduler: Lost executor 2 on 1worker.com: Executor heartbeat timed out after 1006275 ms

and after this, there was this message:

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 67

I modified the properties in spark-defaults.conf as follows:

spark.yarn.scheduler.heartbeat.interval-ms 7200000
spark.executor.heartbeatInterval 7200000
spark.network.timeout 7200000

That's it! My job completed successfully after this.

0 讨论(0)

查看其它8个回答