How does Sparks RDD.randomSplit actually split the RDD

后端未结

关注

 1  551

So assume ive got an rdd with 3000 rows. The 2000 first rows are of class 1 and the 1000 last rows are of class2. The RDD is partitioned across 100 partitions.

When

相关标签:

1条回答

粉色の甜心

2020-12-09 17:45
For each range defined by weights array there is a separate mapPartitionsWithIndex transformation which preserves partitioning.

Each partition is sampled using a set of BernoulliCellSamplers. For each split it iterates over the elements of a given partition and selects item if value of the next random Double is in a given range defined by normalized weights. All samplers for a given partition use the same RNG seed. It means it:
- doesn't shuffle a RDD
- doesn't take continuous blocks other than by chance
- takes a random sample from each partition
- takes non-overlapping samples
- require n-splits passes over data
0 讨论(0)
发布评论:

提交评论
- 加载中...