Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

你离开我真会死。 提交于 2019-11-27 20:28:15

Yes, Apache Spark will unpersist the RDD when it's garbage collected.

In RDD.persist you can see:

sc.cleaner.foreach(_.registerRDDForCleanup(this))

This puts a WeakReference to the RDD in a ReferenceQueue leading to ContextCleaner.doCleanupRDD when the RDD is garbage collected. And there:

sc.unpersistRDD(rddId, blocking)

For more context see ContextCleaner in general and the commit that added it.

A few things to be aware of when relying on garbage collection for unperisting RDDs:

  • The RDDs use resources on the executors, and the garbage collection happens on the driver. The RDD will not be automatically unpersisted until there is enough memory pressure on the driver, no matter how full the disk/memory of the executors gets.
  • You cannot unpersist part of an RDD (some partitions/records). If you build one persisted RDD from another, both will have to fit entirely on the executors at the same time.

As pointed out by @Daniel, Spark will remove partitions from the cache. This will happen once there is no more memory available, and will be done using a least-recently-used algorithm. It is not a smart system, as pointed out by @eliasah.

If you are not caching too many objects you don't have to worry about it. If you cache too many objects, the JVM collection times will become excessive, so it is a good idea to unpersist them in this case.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!