What is RDD dependency in Spark?

谁说我不能喝 提交于 2019-12-07 11:17:41

问题


As I know there are two types of dependencies: narrow & wide. But I dont understand how dependency affects to child RDD. Is child RDD only metadata which contains info how to build new RDD blocks from parent RDD? Or child RDD is self-sufficient set of data which was created from parent RDD?


回答1:


Yes, the child RDD is metadata that describes how to calculate the RDD from the parent RDD.

Consider org/apache/spark/rdd/MappedRDD.scala for example:

private[spark]
class MappedRDD[U: ClassTag, T: ClassTag](prev: RDD[T], f: T => U)
  extends RDD[U](prev) {

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext) =
    firstParent[T].iterator(split, context).map(f)
}

When you say rdd2 = rdd1.map(...), rdd2 will be such a MappedRDD. compute is only executed later, for example when you call rdd2.collect.

An RDD is always such a metadata, even if it has no parents (for example sc.textFile(...)). The only case an RDD is stored on the nodes, is if you mark it for caching with rdd.cache, and then cause it to be computed.

Another similar situation is calling rdd.checkpoint. This function marks the RDD for checkpointing. The next time it is computed, it will be written to disk, and later access to the RDD will cause it to be read from disk instead of recalculated.

The difference between cache and checkpoint is that a cached RDD still retains its dependencies. The cached data can be discarded under memory pressure, and may need to be recalculated in part or whole. This cannot happen with a checkpointed RDD, so the dependencies are discarded there.



来源:https://stackoverflow.com/questions/28514509/what-is-rdd-dependency-in-spark

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!