what is sketch method doing in RangePartitioner of Spark

南笙酒味 提交于 2020-01-04 06:26:24

问题


I was reading the source code of apache spark. And i got stuck at logic of Range Partitioner's sketch method. Can someone please explain me what exactly is this code doing?

// spark/core/src/main/scala/org/apache/spark/Partitioner.scala

def sketch[K:ClassTag](rdd: RDD[K],
  sampleSizePerPartition: Int): (Long, Array[(Int, Int, Array[K])]) = {

  val shift = rdd.id
  // val classTagK = classTag[K] // to avoid serializing the entire partitioner object
  val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
    val seed = byteswap32(idx ^ (shift << 16))
    val (sample, n) = SamplingUtils.reservoirSampleAndCount(
    iter, sampleSizePerPartition, seed)
    Iterator((idx, n, sample))
  }.collect()
  val numItems = sketched.map(_._2.toLong).sum
  (numItems, sketched)
}

回答1:


sketch is used in RangePartitioner to sample values in RDD partitions. That is - to uniformly and randomly pick and collect small subset of element values from every RDD partition.

Note that sketch is used as a part of RangePartitioner - to figure out range bounds for produced approximately equally sized partitions. Other cool things happen in other RangePartitioner code - i.e. when it calculates required size of the sample subset (sampleSizePerPartition).

See my comments as a part of the code for step by step explanation.

def sketch[K:ClassTag](rdd: RDD[K],
  sampleSizePerPartition: Int): (Long, Array[(Int, Int, Array[K])]) = {

  val shift = rdd.id
  // val classTagK = classTag[K] // to avoid serializing the entire partitioner object
  // run sampling function on every partition
  val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
    // partition number `idx` - and rdd.id are used to calculate unique seed for every partition - to ensure that elements are selected in unique manner for every parition
    val seed = byteswap32(idx ^ (shift << 16))
    // randomly select sample of n elements and count total number of elements in partition
    // what is cool about Reservoir Sampling - that it does it in a single pass - O(N) where N is number of elements in partition
    // see more http://en.wikipedia.org/wiki/Reservoir_sampling
    val (sample, n) = SamplingUtils.reservoirSampleAndCount(
    iter, sampleSizePerPartition, seed)
    Iterator((idx, n, sample))
  }.collect()
  val numItems = sketched.map(_._2.toLong).sum
  // returns total count of elements in RDD and samples
  (numItems, sketched)
}


来源:https://stackoverflow.com/questions/25481622/what-is-sketch-method-doing-in-rangepartitioner-of-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!