Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?

前端 未结 8 1307
面向向阳花
面向向阳花 2020-12-07 09:55

I\'m running a Spark job with in a speculation mode. I have around 500 tasks and around 500 files of 1 GB gz compressed. I keep getting in each job, for 1-2 tasks, the attac

相关标签:
8条回答
  • 2020-12-07 10:00

    In my case (standalone cluster) the exception was thrown because the file system of some Spark slaves was filled 100%. Deleting everything in the spark/work folders of the slaves solved the issue.

    0 讨论(0)
  • 2020-12-07 10:01

    This happened to me when I gave more memory to the worker node than it has. Since it didn't have swap, spark crashed while trying to store objects for shuffling with no more memory left.

    Solution was to either add swap, or configure the worker/executor to use less memory in addition with using MEMORY_AND_DISK storage level for several persists.

    0 讨论(0)
  • 2020-12-07 10:03

    I got the same problem, but I searched many answers which can not solve my problem. eventually, I debug my code step by step. I find the problem that caused by the data size is not balanced for each partition , leaded to MetadataFetchFailedException that in map stage not reduce stage . just do df_rdd.repartition(nums) before reduceByKey()

    0 讨论(0)
  • 2020-12-07 10:07

    For me, I was doing some windowing on large data (about 50B rows) and getting a boat load of

    ExternalAppendOnlyUnsafeRowArray:54 - Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter

    In my logs. Obviously 4096 can be small on such data size... this led me to the following JIRA:

    https://issues.apache.org/jira/browse/SPARK-21595

    And ultimately to the following two config options:

    • spark.sql.windowExec.buffer.spill.threshold
    • spark.sql.windowExec.buffer.in.memory.threshold

    Both default to 4096; I raised them much higher (2097152) and things now seem to do well. I'm not 100% sure this is the same as the issue raised here, but it's another thing to try.

    0 讨论(0)
  • 2020-12-07 10:14

    We had a similar error with Spark, but I'm not sure it's related to your issue.

    We used JavaPairRDD.repartitionAndSortWithinPartitions on 100GB data and it kept failing similarly to your app. Then we looked at the Yarn logs on the specific nodes and found out that we have some kind of out-of-memory problem, so the Yarn interrupted the execution. Our solution was to change/add spark.shuffle.memoryFraction 0 in .../spark/conf/spark-defaults.conf. That allowed us to handle a much larger (but unfortunately not infinite) amount of data this way.

    0 讨论(0)
  • 2020-12-07 10:17

    I got the same issue on my 3 machine YARN cluster. I kept changing RAM but the issue persisted. Finally I saw the following messages in the logs:

    17/02/20 13:11:02 WARN spark.HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 1006275 ms exceeds timeout 1000000 ms
    17/02/20 13:11:02 ERROR cluster.YarnScheduler: Lost executor 2 on 1worker.com: Executor heartbeat timed out after 1006275 ms
    

    and after this, there was this message:

    org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 67
    

    I modified the properties in spark-defaults.conf as follows:

    spark.yarn.scheduler.heartbeat.interval-ms 7200000
    spark.executor.heartbeatInterval 7200000
    spark.network.timeout 7200000
    

    That's it! My job completed successfully after this.

    0 讨论(0)
提交回复
热议问题