Understanding Spark's caching

前端 未结 3 1788
天命终不由人
天命终不由人 2020-12-02 08:41

I\'m trying to understand how Spark\'s cache work.

Here is my naive understanding, please let me know if I\'m missing something:

val rdd1 = sc.textF         


        
3条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-02 09:13

    Option B is an optimal approach with small tweak-in. Make use of less expensive action methods. In the approach mentioned by your code, saveAsTextFile is an expensive operation, replace it by count method.

    Idea here is to remove the big rdd1 from DAG, if it's not relevant for further computation (after rdd2 and rdd3 are created)

    Updated approach from code

    val rdd1 = sc.textFile("some data").cache()
    val rdd2 = rdd1.filter(...).cache() 
    val rdd3 = rdd1.map(...).cache()
    
    rdd2.count
    rdd3.count
    
    rdd1.unpersist()
    

提交回复
热议问题