I\'m running a Spark job with in a speculation mode. I have around 500 tasks and around 500 files of 1 GB gz compressed. I keep getting in each job, for 1-2 tasks, the attac
In my case (standalone cluster) the exception was thrown because the file system of some Spark slaves was filled 100%. Deleting everything in the spark/work
folders of the slaves solved the issue.
This happened to me when I gave more memory to the worker node than it has. Since it didn't have swap, spark crashed while trying to store objects for shuffling with no more memory left.
Solution was to either add swap, or configure the worker/executor to use less memory in addition with using MEMORY_AND_DISK storage level for several persists.
I got the same problem, but I searched many answers which can not solve my problem. eventually, I debug my code step by step. I find the problem that caused by the data size is not balanced for each partition , leaded to MetadataFetchFailedException
that in map
stage not reduce
stage . just do df_rdd.repartition(nums)
before reduceByKey()
For me, I was doing some windowing on large data (about 50B rows) and getting a boat load of
ExternalAppendOnlyUnsafeRowArray:54
- Reached spill threshold of 4096 rows, switching toorg.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
In my logs. Obviously 4096 can be small on such data size... this led me to the following JIRA:
https://issues.apache.org/jira/browse/SPARK-21595
And ultimately to the following two config options:
spark.sql.windowExec.buffer.spill.threshold
spark.sql.windowExec.buffer.in.memory.threshold
Both default to 4096; I raised them much higher (2097152) and things now seem to do well. I'm not 100% sure this is the same as the issue raised here, but it's another thing to try.
We had a similar error with Spark, but I'm not sure it's related to your issue.
We used JavaPairRDD.repartitionAndSortWithinPartitions
on 100GB data and it kept failing similarly to your app. Then we looked at the Yarn logs on the specific nodes and found out that we have some kind of out-of-memory problem, so the Yarn interrupted the execution. Our solution was to change/add spark.shuffle.memoryFraction 0
in .../spark/conf/spark-defaults.conf
. That allowed us to handle a much larger (but unfortunately not infinite) amount of data this way.
I got the same issue on my 3 machine YARN cluster. I kept changing RAM but the issue persisted. Finally I saw the following messages in the logs:
17/02/20 13:11:02 WARN spark.HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 1006275 ms exceeds timeout 1000000 ms
17/02/20 13:11:02 ERROR cluster.YarnScheduler: Lost executor 2 on 1worker.com: Executor heartbeat timed out after 1006275 ms
and after this, there was this message:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 67
I modified the properties in spark-defaults.conf as follows:
spark.yarn.scheduler.heartbeat.interval-ms 7200000
spark.executor.heartbeatInterval 7200000
spark.network.timeout 7200000
That's it! My job completed successfully after this.