How does Sparks RDD.randomSplit actually split the RDD

一笑奈何 提交于 2019-11-28 09:05:11

For each range defined by weights array there is a separate mapPartitionsWithIndex transformation which preserves partitioning.

Each partition is sampled using a set of BernoulliCellSamplers. For each split it iterates over the elements of a given partition and selects item if value of the next random Double is in a given range defined by normalized weights. All samplers for a given partition use the same RNG seed. It means it:

  • doesn't shuffle a RDD
  • doesn't take continuous blocks other than by chance
  • takes a random sample from each partition
  • takes non-overlapping samples
  • require n-splits passes over data
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!