How to delete an RDD in PySpark for the purpose of releasing resources?

前端 未结 4 891
别跟我提以往
别跟我提以往 2020-12-31 00:36

If I have an RDD that I no longer need, how do I delete it from memory? Would the following be enough to get this done:

del thisRDD

Thanks!

4条回答
  •  [愿得一人]
    2020-12-31 01:27

    No, del thisRDD is not enough, it would just delete the pointer to the RDD. You should call thisRDD.unpersist() to remove the cached data.

    For you information, Spark uses a model of lazy computations, which means that when you run this code:

    >>> thisRDD = sc.parallelize(xrange(10),2).cache()
    

    you won't have any data cached really, it would be only marked as 'to be cached' in the RDD execution plan. You can check it this way:

    >>> print thisRDD.toDebugString()
    (2) PythonRDD[6] at RDD at PythonRDD.scala:43 [Memory Serialized 1x Replicated]
     |  ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:364 [Memory Serialized 1x Replicated]
    

    But when you call an action on top of this RDD at least once, it would become cached:

    >>> thisRDD.count()
    10
    >>> print thisRDD.toDebugString()
    (2) PythonRDD[6] at RDD at PythonRDD.scala:43 [Memory Serialized 1x Replicated]
     |       CachedPartitions: 2; MemorySize: 174.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B
     |  ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:364 [Memory Serialized 1x Replicated]
    

    You can easily check the persisted data and the level of persistence in the Spark UI using the address http://:4040/storage. You would see there that del thisRDD won't change the persistence of this RDD, but thisRDD.unpersist() would unpersist it, while you still would be able to use thisRDD in your code (while it won't persist in memory anymore and would be recomputed each time it is queried)

提交回复
热议问题