Experts, I am noticing one peculiar thing with one of the Pyspark jobs in production(running in YARN cluster mode). After executing for around an hour + (around 65-75 mins),
Are you breaking the lineage? If not then the issue might be with lineage. Can you try breaking the lineage in between the code somewhere and try it.
#Spark 1.6 code
sc.setCheckpointDit('.')
#df is the original dataframe name you are performing transformations on
dfrdd = df.rdd
dfrdd.checkpoint()
df=sqlContext.createDataFrame(dfrdd)
print df.count()
Let me know if it helps.