How does Spark achieve sort order?

后端未结

关注

 1  390

Assume I have a list of Strings. I filter & sort them, and collect the result to driver. However, things are distributed, and each RDD has it\'s own part of original lis

相关标签:

1条回答

囚心锁ツ

2020-12-05 15:26
Sorting in Spark is a multiphase process which requires shuffling:
1. input RDD is sampled and this sample is used to compute boundaries for each output partition (sample followed by collect)
2. input RDD is partitioned using rangePartitioner with boundaries computed in the first step (partitionBy)
3. each partition from the second step is sorted locally (mapPartitions)
When the data is collected, all that is left is to follow the order defined by the partitioner.

Above steps are clearly reflected in a debug string:
```
scala> val rdd = sc.parallelize(Seq(4, 2, 5, 3, 1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at ...

scala> rdd.sortBy(identity).toDebugString
res1: String = 
(6) MapPartitionsRDD[10] at sortBy at <console>:24 [] // Sort partitions
 |  ShuffledRDD[9] at sortBy at <console>:24 [] // Shuffle
 +-(8) MapPartitionsRDD[6] at sortBy at <console>:24 [] // Pre-shuffle steps
    |  ParallelCollectionRDD[0] at parallelize at <console>:21 [] // Parallelize
```
0 讨论(0)
发布评论:

提交评论
- 加载中...