I\'m running a Spark job with in a speculation mode. I have around 500 tasks and around 500 files of 1 GB gz compressed. I keep getting in each job, for 1-2 tasks, the attac
We had a similar error with Spark, but I'm not sure it's related to your issue.
We used JavaPairRDD.repartitionAndSortWithinPartitions
on 100GB data and it kept failing similarly to your app. Then we looked at the Yarn logs on the specific nodes and found out that we have some kind of out-of-memory problem, so the Yarn interrupted the execution. Our solution was to change/add spark.shuffle.memoryFraction 0
in .../spark/conf/spark-defaults.conf
. That allowed us to handle a much larger (but unfortunately not infinite) amount of data this way.