In Spark, each time we do any action on an RDD, the RDD is re-computed. So if we know that the RDD is going to be reused, we should cache the RDD explicitly.
Let\'s
A subjective list of reasons:
it is impossible to cache automatically without making assumptions about application semantics. In particular:
It is also worth to note that: