How does Sparks RDD.randomSplit actually split the RDD

后端 未结 1 548
难免孤独
难免孤独 2020-12-09 17:00

So assume ive got an rdd with 3000 rows. The 2000 first rows are of class 1 and the 1000 last rows are of class2. The RDD is partitioned across 100 partitions.

When

相关标签:
1条回答
  • 2020-12-09 17:45

    For each range defined by weights array there is a separate mapPartitionsWithIndex transformation which preserves partitioning.

    Each partition is sampled using a set of BernoulliCellSamplers. For each split it iterates over the elements of a given partition and selects item if value of the next random Double is in a given range defined by normalized weights. All samplers for a given partition use the same RNG seed. It means it:

    • doesn't shuffle a RDD
    • doesn't take continuous blocks other than by chance
    • takes a random sample from each partition
    • takes non-overlapping samples
    • require n-splits passes over data
    0 讨论(0)
提交回复
热议问题