In Apache Spark cogroup, how to make sure 1 RDD of >2 operands is not moved?

喜你入骨 提交于 2019-12-24 07:10:25

问题


In a cogroup transformation, e.g. RDD1.cogroup(RDD2, ...), I used to assume that Spark only shuffles/moves RDD2 and retains RDD1's partitioning and in-memory storage if:

  1. RDD1 has an explicit partitioner
  2. RDD1 is cached.

In my other projects most of the shuffling behaviour seems to be consistent with this assumption. So yesterday I wrote a short scala program to prove it once and for all:

// sc is the SparkContext
val rdd1 = sc.parallelize(1 to 10, 4).map(v => v->v)
  .partitionBy(new HashPartitioner(4))
rdd1.persist().count()
val rdd2 = sc.parallelize(1 to 10, 4).map(v => (11-v)->v)

val cogrouped = rdd1.cogroup(rdd2).map {
  v =>
    v._2._1.head -> v._2._2.head
}

val zipped = cogrouped.zipPartitions(rdd1, rdd2) {
  (itr1, itr2, itr3) =>
    itr1.zipAll(itr2.map(_._2), 0->0, 0).zipAll(itr3.map(_._2), (0->0)->0, 0)
      .map {
        v =>
          (v._1._1._1, v._1._1._2, v._1._2, v._2)
      }
}

zipped.collect().foreach(println)

If rdd1 doesn't move the first column of zipped should have the same value as the third column, so I ran the programs, oops:

(4,7,4,1)
(8,3,8,2)
(1,10,1,3)
(9,2,5,4)
(5,6,9,5)
(6,5,2,6)
(10,1,6,7)
(2,9,10,0)
(3,8,3,8)
(7,4,7,9)
(0,0,0,10)

The assumption is not true. Spark probably did some internal optimisation and decided that regenerating rdd1's partitions is much faster than keeping them in cache.

So the question is: If my programmatic requirement to not move RDD1 (and keep it cached) is because of other reasons than speed (e.g. resource locality), or in some occasions Spark internal optimisation is not preferrable, is there a way to explicitly instruct the framework to not move an operand in all cogroup-like operations? This also include join, outer join, and groupWith.

Thanks a lot for your help. So far I'm using broadcast join as a not-so-scalable makeshift solution, it is not going to last long before crashing my cluster. I'm expecting a solution consistent with the distributed computing principal.


回答1:


If rdd1 doesn't move the first column of zipped should have the same value as the third column

This assumption is just incorrect. Creating CoGroupedRDD is not only about shuffle, but also about generating internal structures required for matching corresponding records. Internally Spark will use its own ExternalAppendOnlyMap which uses custom open hash table implementation (AppendOnlyMap) which doesn't provide any ordering guarantees.

If you check debug string:

zipped.toDebugString
(4) ZippedPartitionsRDD3[8] at zipPartitions at <console>:36 []
 |  MapPartitionsRDD[7] at map at <console>:31 []
 |  MapPartitionsRDD[6] at cogroup at <console>:31 []
 |  CoGroupedRDD[5] at cogroup at <console>:31 []
 |  ShuffledRDD[2] at partitionBy at <console>:27 []
 |      CachedPartitions: 4; MemorySize: 512.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 +-(4) MapPartitionsRDD[1] at map at <console>:26 []
    |  ParallelCollectionRDD[0] at parallelize at <console>:26 []
 +-(4) MapPartitionsRDD[4] at map at <console>:29 []
    |  ParallelCollectionRDD[3] at parallelize at <console>:29 []
 |  ShuffledRDD[2] at partitionBy at <console>:27 []
 |      CachedPartitions: 4; MemorySize: 512.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 +-(4) MapPartitionsRDD[1]...

you'll see that Spark indeed uses CachedPartitions to compute zipped RDD. If you also skip map transformations, which removes partitioner, you'll see that coGroup reuses partitioner provided by rdd1:

rdd1.cogroup(rdd2).partitioner == rdd1.partitioner
Boolean = true


来源:https://stackoverflow.com/questions/45015512/in-apache-spark-cogroup-how-to-make-sure-1-rdd-of-2-operands-is-not-moved

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!