When does a RDD lineage is created? How to find lineage graph?

两盒软妹~` 提交于 2019-12-01 11:37:41

问题


I am learning Apache Spark and trying to get the lineage graph of the RDDs. But i could not find when does a particular lineage is created? Also, where to find the lineage of an RDD?


回答1:


RDD Lineage is the logical execution plan of a distributed computation that is created and expanded every time you apply a transformation on any RDD.

Note the part "logical" not "physical" that happens after you've executed an action.

Quoting Mastering Apache Spark 2 gitbook:

RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.

A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.

Any RDD has a RDD lineage even if that means that the RDD lineage is just a single node, i.e. the RDD itself. That's because an RDD may or may not be a result of a series of transformations (and no transformations is a "zero-effect" transformation :))

You can check out the RDD lineage of an RDD using RDD.toDebugString:

toDebugString: String A description of this RDD and its recursive dependencies for debugging.

val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[0] at parallelize at <console>:24 []

val doubles = nums.map(_ * 2)
scala> doubles.toDebugString
res1: String =
(8) MapPartitionsRDD[1] at map at <console>:25 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

val groups = doubles.groupBy(_ < 10)
scala> groups.toDebugString
res2: String =
(8) ShuffledRDD[3] at groupBy at <console>:25 []
 +-(8) MapPartitionsRDD[2] at groupBy at <console>:25 []
    |  MapPartitionsRDD[1] at map at <console>:25 []
    |  ParallelCollectionRDD[0] at parallelize at <console>:24 []


来源:https://stackoverflow.com/questions/47693355/when-does-a-rdd-lineage-is-created-how-to-find-lineage-graph

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!