I would like to understand in which node (driver or worker/executor) does below code is stored
df.cache() //df is a large dataframe (200GB)
And
df.cache() calls the persist() method which stores on storage level as MEMORY_AND_DISK, but you can change the storage level
The persist() method calls
sparkSession.sharedState.cacheManager.cacheQuery()
and when you see the code for cacheTable it also calls the same
sparkSession.sharedState.cacheManager.cacheQuery()
that means both are same and are lazily evaluated (only evaluated once action is performed), except persist method can store as the storage level provided, these are the available storage level
You can also use the SQL CACHE TABLE which is not lazily evaluated and stores the whole table in memory, which may also lead to OOM
Summary: cache(), persist(), cacheTable() are lazily evaluated and need to perform an action to work where as SQL CACHE TABLE is an eager
See here for details!
You can choose as per your requirement!
Hope this helps!