Considering memory being limited, I had a feeling that spark automatically removes RDD from each node. I\'d like to know is this time configurable? How does spark decide whe
In general, that's how Yuval Itzchakov wrote "just like any other object", but...(there's always "but", isn't it?)
In Spark, it's not that obvious since we have shuffle blocks (among the other blocks managed by Spark). They are managed by BlockManagers running on executors. They somehow will have to be notified when an object on the driver gets evicted from memory, right?
That's where ContextCleaner comes to stage. It's Spark Application's Garbage Collector that is responsible for application-wide cleanup of shuffles, RDDs, broadcasts, accumulators and checkpointed RDDs that is aimed at reducing the memory requirements of long-running data-heavy Spark applications.
ContextCleaner runs on the driver. It is created and immediately started when SparkContext
starts (and spark.cleaner.referenceTracking
Spark property is enabled, which it is by default). It is stopped when SparkContext
is stopped.
You can see it working by doing the dump of all the threads in a Spark application using jconsole
or jstack
. ContextCleaner uses a daemon Spark Context Cleaner thread that cleans RDD, shuffle, and broadcast states.
You can also see its work by enabling INFO
or DEBUG
logging levels for org.apache.spark.ContextCleaner
logger. Just add the following line to conf/log4j.properties
:
log4j.logger.org.apache.spark.ContextCleaner=DEBUG