Understanding Spark's caching

前端 未结 3 1789
天命终不由人
天命终不由人 2020-12-02 08:41

I\'m trying to understand how Spark\'s cache work.

Here is my naive understanding, please let me know if I\'m missing something:

val rdd1 = sc.textF         


        
3条回答
  •  忘掉有多难
    2020-12-02 09:03

    It would seem that Option B is required. The reason is related to how persist/cache and unpersist are executed by Spark. Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution.

    This is relevant because a cache or persist call just adds the RDD to a Map of RDDs that marked themselves to be persisted during job execution. However, unpersist directly tells the blockManager to evict the RDD from storage and removes the reference in the Map of persistent RDDs.

    persist function

    unpersist function

    So you would need to call unpersist after Spark actually executed and stored the RDD with the block manager.

    The comments for the RDD.persist method hint towards this: rdd.persist

提交回复
热议问题