问题
I was reading the source code of apache spark. And i got stuck at logic of Range Partitioner's sketch method. Can someone please explain me what exactly is this code doing?
// spark/core/src/main/scala/org/apache/spark/Partitioner.scala
def sketch[K:ClassTag](rdd: RDD[K],
sampleSizePerPartition: Int): (Long, Array[(Int, Int, Array[K])]) = {
val shift = rdd.id
// val classTagK = classTag[K] // to avoid serializing the entire partitioner object
val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
val seed = byteswap32(idx ^ (shift << 16))
val (sample, n) = SamplingUtils.reservoirSampleAndCount(
iter, sampleSizePerPartition, seed)
Iterator((idx, n, sample))
}.collect()
val numItems = sketched.map(_._2.toLong).sum
(numItems, sketched)
}
回答1:
sketch is used in RangePartitioner to sample values in RDD partitions. That is - to uniformly and randomly pick and collect small subset of element values from every RDD partition.
Note that sketch is used as a part of RangePartitioner - to figure out range bounds for produced approximately equally sized partitions. Other cool things happen in other RangePartitioner code - i.e. when it calculates required size of the sample subset (sampleSizePerPartition).
See my comments as a part of the code for step by step explanation.
def sketch[K:ClassTag](rdd: RDD[K],
sampleSizePerPartition: Int): (Long, Array[(Int, Int, Array[K])]) = {
val shift = rdd.id
// val classTagK = classTag[K] // to avoid serializing the entire partitioner object
// run sampling function on every partition
val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
// partition number `idx` - and rdd.id are used to calculate unique seed for every partition - to ensure that elements are selected in unique manner for every parition
val seed = byteswap32(idx ^ (shift << 16))
// randomly select sample of n elements and count total number of elements in partition
// what is cool about Reservoir Sampling - that it does it in a single pass - O(N) where N is number of elements in partition
// see more http://en.wikipedia.org/wiki/Reservoir_sampling
val (sample, n) = SamplingUtils.reservoirSampleAndCount(
iter, sampleSizePerPartition, seed)
Iterator((idx, n, sample))
}.collect()
val numItems = sketched.map(_._2.toLong).sum
// returns total count of elements in RDD and samples
(numItems, sketched)
}
来源:https://stackoverflow.com/questions/25481622/what-is-sketch-method-doing-in-rangepartitioner-of-spark