How does Spark achieve sort order?

后端 未结 1 388
耶瑟儿~
耶瑟儿~ 2020-12-05 14:41

Assume I have a list of Strings. I filter & sort them, and collect the result to driver. However, things are distributed, and each RDD has it\'s own part of original lis

相关标签:
1条回答
  • 2020-12-05 15:26

    Sorting in Spark is a multiphase process which requires shuffling:

    1. input RDD is sampled and this sample is used to compute boundaries for each output partition (sample followed by collect)
    2. input RDD is partitioned using rangePartitioner with boundaries computed in the first step (partitionBy)
    3. each partition from the second step is sorted locally (mapPartitions)

    When the data is collected, all that is left is to follow the order defined by the partitioner.

    Above steps are clearly reflected in a debug string:

    scala> val rdd = sc.parallelize(Seq(4, 2, 5, 3, 1))
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at ...
    
    scala> rdd.sortBy(identity).toDebugString
    res1: String = 
    (6) MapPartitionsRDD[10] at sortBy at <console>:24 [] // Sort partitions
     |  ShuffledRDD[9] at sortBy at <console>:24 [] // Shuffle
     +-(8) MapPartitionsRDD[6] at sortBy at <console>:24 [] // Pre-shuffle steps
        |  ParallelCollectionRDD[0] at parallelize at <console>:21 [] // Parallelize
    
    0 讨论(0)
提交回复
热议问题