问题
Can anyone explain why rdd blocks are increasing when i am running the spark code second time even though they are stored in spark memory during first run.I am giving input using thread.what is the exact meaning of rdd blocks.
回答1:
I have been researching about this today and it seems RDD blocks is the sum of RDD blocks and non-RDD blocks. Check out the code at: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala
val rddBlocks = status.numBlocks
And if you go to the below link of Apache Spark Repo on Github: https://github.com/apache/spark/blob/d5b1d5fc80153571c308130833d0c0774de62c92/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala
You will find below lines of code:
/**
* Return the number of blocks stored in this block manager in O(RDDs) time.
*
* @note This is much faster than `this.blocks.size`, which is O(blocks) time.
*/
def numBlocks: Int = _nonRddBlocks.size + numRddBlocks
Non-rdd blocks are the ones created by broadcast variables as they are stored as cached blocks in memory. The tasks are sent by driver to the executors through broadcast variables. Now these system created broadcast variables are deleted through ContextCleaner service and consequently the corresponding non-RDD block is removed. RDD blocks are unpersisted through rdd.unpersist().
来源:https://stackoverflow.com/questions/38067919/can-anyone-explain-about-rdd-blocks-in-executors