How does Round Robin partitioning in Spark work?

霸气de小男生 提交于 2019-12-04 19:45:15

(Checked for Spark version 2.1-2.4)

As far as I can see from ShuffleExchangeExec code, Spark tries to partition the rows directly from original partitions (via mapPartitions) without bringing anything to the driver.

The logic is to start with a randomly picked target partition and then assign partitions to the rows in a round-robin method. Note that "start" partition is picked for each source partition and there could be collisions.

The final distribution depends on many factors: a number of source/target partitions and the number of rows in your dataframe.

I can't explain why but somehow it is link to the local master.

if you explicit set :

  • --master local => 1 row per partition (no parallelism)

  • --master "local[2]" => 2 rows per partition (4 partitions empty)

  • --master "local[4]" => 4 rows per partition (6 partitions empty)

  • --master "local[8]" => 8 rows per partition (7 partitions empty)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!