I\'m running a spark job. It shows that all of the jobs were completed:
however after couple of minutes the entire job restarts, this time it will show all jobs and
What solved this eventually was persisting both data frames before join.
I looked at the execution plan before and after persisting the data frames, and the strange thing was that before persisting spark tried to perform a BroadcastHashJoin
, which clearly failed due to large size of the data frame, and after persisting the execution plan showed that the join will be ShuffleHashJoin
, that completed without any issues whatsoever. A bug? Maybe, I'll try with a newer spark version when I'll get to it.