Assume I have a list of Strings. I filter & sort them, and collect the result to driver. However, things are distributed, and each RDD has it\'s own part of original lis
Sorting in Spark is a multiphase process which requires shuffling:
sample
followed by collect
)rangePartitioner
with boundaries computed in the first step (partitionBy
)mapPartitions
)When the data is collected, all that is left is to follow the order defined by the partitioner.
Above steps are clearly reflected in a debug string:
scala> val rdd = sc.parallelize(Seq(4, 2, 5, 3, 1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at ...
scala> rdd.sortBy(identity).toDebugString
res1: String =
(6) MapPartitionsRDD[10] at sortBy at <console>:24 [] // Sort partitions
| ShuffledRDD[9] at sortBy at <console>:24 [] // Shuffle
+-(8) MapPartitionsRDD[6] at sortBy at <console>:24 [] // Pre-shuffle steps
| ParallelCollectionRDD[0] at parallelize at <console>:21 [] // Parallelize