Spark clean up shuffle spilled to disk

问题

I have a looping operation which generates some RDDs, does repartition, then a aggregatebykey operation. After the loop runs onces, it computes a final RDD, which is cached and checkpointed, and also used as the initial RDD for the next loop.

These RDDs are quite large and generate lots of intermediate shuffle blocks before arriving a the final RDD for every iteration. I am compressing my shuffles and allowing shuffles to spill to disk.

I notice on my worker machines that my working directory where the shuffle files are stores are not being cleaned up. Thus eventually I run out of disk space. I was under the impression that if I checkpoint my RDD, it would remove all the intermediate shuffle blocks. However this seems not to be happening. Would anyone have any ideas on how I could clean out my shuffle blocks after every iteration of the loop, or why my shuffle blocks aren't being cleaned up?

回答1:

Once you cached the RDD to your memory/disk, as long as the spark context is alive, the RDD will be stored in your memory/disk.

In order to tell the driver it can remove the RDD from the memory/disk you need to use the unpersist() function.

From the java-doc:

 /**
   * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
   *
   * @param blocking Whether to block until all blocks are deleted.
   * @return This RDD.
   */
  def unpersist(blocking: Boolean = true)

So you can use:

rdd.unpersist()

回答2:

depend on if you have dependency between those RDDs. For example:

val rdd2 = rdd1.<transformation>
val rdd3 = rdd2.<transformation>
...

this case, spark will remember the lineage, and there will always be a reference to the old RDD which makes it not be chosen to be cleaned up by spark driver(spark rdd clean up is down by gc on spark driver to recycle some rdd reference once they are not referred anymore).
so persist() won't work in this case, and the only way is to use localCheckpoint(). This is what I have done before and worked for me:

rdd.persist(StorageLevel.MEMORY_AND_DISK_2)
          .localCheckpoint()
// do sth here and later
rdd.unpersist()

this makes spark truncate the lineage correctly and then you can safely unpersist() it without worrying about uncleaned reference.
please refer to spark doc see how to properly truncate execution plan lineage: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-checkpointing.html

来源：https://stackoverflow.com/questions/34788507/spark-clean-up-shuffle-spilled-to-disk

标签

scala

apache-spark

out-of-memory

spark-streaming