I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution;
df1.cache()
df2.cache()
just do the following:
df1.unpersist()
df2.unpersist()
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
If the dataframe registered as a table for SQL operations, like
df.createGlobalTempView(tableName) // or some other way as per spark verision
then the cache can be dropped with following commands, off-course spark also does it automatically
Here spark
is an object of SparkSession
Drop a specific table/df from cache
spark.catalog.uncacheTable(tableName)
Drop all tables/dfs from cache
spark.catalog.clearCache()
Drop a specific table/df from cache
sqlContext.uncacheTable(tableName)
Drop all tables/dfs from cache
sqlContext.clearCache()