How does Sparks RDD.randomSplit actually split the RDD

懵懂的女人 提交于 2019-11-27 02:38:42

问题


So assume ive got an rdd with 3000 rows. The 2000 first rows are of class 1 and the 1000 last rows are of class2. The RDD is partitioned across 100 partitions.

When calling RDD.randomSplit(0.8,0.2)

Does the function also shuffle the rdd? Our does the splitting simply sample 20% continuously of the rdd? Or does it select 20% of the partitions randomly?

Ideally does the resulting split have the same class distribution as the original RDD. (i.e. 2:1)

Thanks


回答1:


For each range defined by weights array there is a separate mapPartitionsWithIndex transformation which preserves partitioning.

Each partition is sampled using a set of BernoulliCellSamplers. For each split it iterates over the elements of a given partition and selects item if value of the next random Double is in a given range defined by normalized weights. All samplers for a given partition use the same RNG seed. It means it:

  • doesn't shuffle a RDD
  • doesn't take continuous blocks other than by chance
  • takes a random sample from each partition
  • takes non-overlapping samples
  • require n-splits passes over data


来源:https://stackoverflow.com/questions/32933143/how-does-sparks-rdd-randomsplit-actually-split-the-rdd

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!