Spark job restarted after showing all jobs completed and then fails (TimeoutException: Futures timed out after [300 seconds])

问题

I'm running a spark job. It shows that all of the jobs were completed:

however after couple of minutes the entire job restarts, this time it will show all jobs and tasks were completed too, but after couple of minutes it will fail. I found this exception in the logs:

java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]

So this happens when I'm trying to join 2 pretty big tables: one of 3B rows, and the second is 200M rows, when I run show(100) on the resulting dataframe, everything gets evaluated and I'm getting this issue.

I tried playing around with increasing/decreasing the number of partitions, I changed the garbage collector to G1 with increased number of threads. I changed spark.sql.broadcastTimeout to 600 (which made the time out message to change to 600 seconds).

I also read that this might be a communication issue, however other show() clauses that run prior this code segment work without problems, so it's probably not it.

This is the submit command:

/opt/spark/spark-1.4.1-bin-hadoop2.3/bin/spark-submit  --master yarn-cluster --class className --executor-memory 12g --executor-cores 2 --driver-memory 32g --driver-cores 8 --num-executors 40 --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:ConcGCThreads=20" /home/asdf/fileName-assembly-1.0.jar

you can get the idea about spark versions, and the resources used from there.

Where do I go from here? Any help will be appreciated, and code segments/additional logging will be provided if needed.

回答1:

What solved this eventually was persisting both data frames before join.

I looked at the execution plan before and after persisting the data frames, and the strange thing was that before persisting spark tried to perform a BroadcastHashJoin, which clearly failed due to large size of the data frame, and after persisting the execution plan showed that the join will be ShuffleHashJoin, that completed without any issues whatsoever. A bug? Maybe, I'll try with a newer spark version when I'll get to it.

来源：https://stackoverflow.com/questions/36290486/spark-job-restarted-after-showing-all-jobs-completed-and-then-fails-timeoutexce

标签

scala

apache-spark

apache-spark-sql

spark-dataframe