Spark clean up shuffle spilled to disk

时光毁灭记忆、已成空白 提交于 2019-12-05 17:30:55

Once you cached the RDD to your memory/disk, as long as the spark context is alive, the RDD will be stored in your memory/disk.

In order to tell the driver it can remove the RDD from the memory/disk you need to use the unpersist() function.

From the java-doc:

 /**
   * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
   *
   * @param blocking Whether to block until all blocks are deleted.
   * @return This RDD.
   */
  def unpersist(blocking: Boolean = true)

So you can use:

rdd.unpersist()

depend on if you have dependency between those RDDs. For example:

val rdd2 = rdd1.<transformation>
val rdd3 = rdd2.<transformation>
...

this case, spark will remember the lineage, and there will always be a reference to the old RDD which makes it not be chosen to be cleaned up by spark driver(spark rdd clean up is down by gc on spark driver to recycle some rdd reference once they are not referred anymore).
so persist() won't work in this case, and the only way is to use localCheckpoint(). This is what I have done before and worked for me:

rdd.persist(StorageLevel.MEMORY_AND_DISK_2)
          .localCheckpoint()
// do sth here and later
rdd.unpersist()

this makes spark truncate the lineage correctly and then you can safely unpersist() it without worrying about uncleaned reference.
please refer to spark doc see how to properly truncate execution plan lineage: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-checkpointing.html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!