Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?

前端未结

关注

 8  1307

I\'m running a Spark job with in a speculation mode. I have around 500 tasks and around 500 files of 1 GB gz compressed. I keep getting in each job, for 1-2 tasks, the attac

相关标签:

8条回答

情深已故

2020-12-07 10:00

In my case (standalone cluster) the exception was thrown because the file system of some Spark slaves was filled 100%. Deleting everything in the spark/work folders of the slaves solved the issue.

0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2020-12-07 10:01

This happened to me when I gave more memory to the worker node than it has. Since it didn't have swap, spark crashed while trying to store objects for shuffling with no more memory left.

Solution was to either add swap, or configure the worker/executor to use less memory in addition with using MEMORY_AND_DISK storage level for several persists.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2020-12-07 10:03

I got the same problem, but I searched many answers which can not solve my problem. eventually, I debug my code step by step. I find the problem that caused by the data size is not balanced for each partition , leaded to MetadataFetchFailedException that in map stage not reduce stage . just do df_rdd.repartition(nums) before reduceByKey()

0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2020-12-07 10:07
For me, I was doing some windowing on large data (about 50B rows) and getting a boat load of

ExternalAppendOnlyUnsafeRowArray:54 - Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter

In my logs. Obviously 4096 can be small on such data size... this led me to the following JIRA:

https://issues.apache.org/jira/browse/SPARK-21595

And ultimately to the following two config options:
- spark.sql.windowExec.buffer.spill.threshold
- spark.sql.windowExec.buffer.in.memory.threshold
Both default to 4096; I raised them much higher (2097152) and things now seem to do well. I'm not 100% sure this is the same as the issue raised here, but it's another thing to try.
0 讨论(0)
发布评论:

提交评论
- 加载中...
感情败类

2020-12-07 10:14

We had a similar error with Spark, but I'm not sure it's related to your issue.

We used JavaPairRDD.repartitionAndSortWithinPartitions on 100GB data and it kept failing similarly to your app. Then we looked at the Yarn logs on the specific nodes and found out that we have some kind of out-of-memory problem, so the Yarn interrupted the execution. Our solution was to change/add spark.shuffle.memoryFraction 0 in .../spark/conf/spark-defaults.conf. That allowed us to handle a much larger (but unfortunately not infinite) amount of data this way.

0 讨论(0)
发布评论:

提交评论
- 加载中...

眼角桃花

2020-12-07 10:17

I got the same issue on my 3 machine YARN cluster. I kept changing RAM but the issue persisted. Finally I saw the following messages in the logs:

17/02/20 13:11:02 WARN spark.HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 1006275 ms exceeds timeout 1000000 ms
17/02/20 13:11:02 ERROR cluster.YarnScheduler: Lost executor 2 on 1worker.com: Executor heartbeat timed out after 1006275 ms

and after this, there was this message:

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 67

I modified the properties in spark-defaults.conf as follows:

spark.yarn.scheduler.heartbeat.interval-ms 7200000
spark.executor.heartbeatInterval 7200000
spark.network.timeout 7200000

That's it! My job completed successfully after this.

0 讨论(0)

1 2 下一页