So assume ive got an rdd with 3000 rows. The 2000 first rows are of class 1 and the 1000 last rows are of class2. The RDD is partitioned across 100 partitions.
When
For each range defined by weights
array there is a separate mapPartitionsWithIndex
transformation which preserves partitioning.
Each partition is sampled using a set of BernoulliCellSamplers
. For each split it iterates over the elements of a given partition and selects item if value of the next random Double
is in a given range defined by normalized weights. All samplers for a given partition use the same RNG seed. It means it: