How long does RDD remain in memory?

前端未结

关注

 4  1185

半阙折子戏 2021-01-05 09:41

Considering memory being limited, I had a feeling that spark automatically removes RDD from each node. I\'d like to know is this time configurable? How does spark decide whe

4条回答

既然无缘 (楼主)

2021-01-05 10:16
In general, that's how Yuval Itzchakov wrote "just like any other object", but...(there's always "but", isn't it?)

In Spark, it's not that obvious since we have shuffle blocks (among the other blocks managed by Spark). They are managed by BlockManagers running on executors. They somehow will have to be notified when an object on the driver gets evicted from memory, right?

That's where ContextCleaner comes to stage. It's Spark Application's Garbage Collector that is responsible for application-wide cleanup of shuffles, RDDs, broadcasts, accumulators and checkpointed RDDs that is aimed at reducing the memory requirements of long-running data-heavy Spark applications.

ContextCleaner runs on the driver. It is created and immediately started when SparkContext starts (and spark.cleaner.referenceTracking Spark property is enabled, which it is by default). It is stopped when SparkContext is stopped.

You can see it working by doing the dump of all the threads in a Spark application using jconsole or jstack. ContextCleaner uses a daemon Spark Context Cleaner thread that cleans RDD, shuffle, and broadcast states.

You can also see its work by enabling INFO or DEBUG logging levels for org.apache.spark.ContextCleaner logger. Just add the following line to conf/log4j.properties:
```
log4j.logger.org.apache.spark.ContextCleaner=DEBUG
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...