In Spark, can I find out the machine in the cluster which stores a given element in RDD and then send message to it?

廉价感情. 提交于 2019-11-28 12:28:34

问题


I am new to Spark.

I want to know if in an RDD, for example, RDD = {"0", "1", "2",... "99999"}, can I find out the machine in the cluster which stores a given element (e.g.: 100)?

And then in shuffle, can I aggregate some data and send it to the certain machine? I know that the partition of RDD is transparent for users but could I use some method like key/value to achieve that?


回答1:


Generally speaking the answer is no or at least not with RDD API. If you can express your logic using graphs then you can try message based API in GraphX or Giraph. If not then using Akka directly instead of Spark could be a better choice.

Still, there are some workarounds but I wouldn't expect high performance. Lets start with some dummy data:

import org.apache.spark.rdd.RDD

val toPairs = (s: Range) => s.map(_.toChar.toString)

val rdd: RDD[(Int, String)] = sc.parallelize(Seq(
  (0, toPairs(97 to 100)), // a-d
  (1, toPairs(101 to 107)), // e-k
  (2, toPairs(108 to 115)) // l-s
)).flatMap{ case (i, vs) => vs.map(v => (i, v)) }

and partition it using custom partitioner:

import org.apache.spark.Partitioner

class IdentityPartitioner(n: Int) extends Partitioner {
  def numPartitions: Int = n
  def getPartition(key: Any): Int = key.asInstanceOf[Int]
}

val partitioner = new IdentityPartitioner(4)
val parts = rdd.partitionBy(partitioner)

Now we have RDD with 4 partitions including one empty:

parts.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size))).collect
// Array[(Int, Int)] = Array((0,4), (1,7), (2,8), (3,0))
  • The simplest thing you can do is to leverage partitioning itself. First a dummy function and a helper:

    // Dummy map function
    def transform(s: String) =
      Map("e" -> "x", "k" -> "y", "l" -> "z").withDefault(identity)(s)
    
    // Map String to partition
    def address(curr: Int, s: String) = {
      val m = Map("x" -> 3, "y" -> 3, "z" -> 3).withDefault(x => curr)
      (m(s), s)
    }
    

    and "send" data:

    val transformed: RDD[(Int, String)] = parts
      // Emit pairs (partition, string)
      .map{case (i, s) => address(i, transform(s))}
      // Repartition
      .partitionBy(partitioner)
    
    transformed
      .mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size)))
      .collect
    // Array[(Int, Int)] = Array((0,4), (1,5), (2,7), (3,3))
    
  • another approach is to collect "messages":

    val tmp = parts.mapValues(s => transform(s))
    
    val messages: Map[Int,Iterable[String]] = tmp
      .flatMap{case (i, s) => {
         val target = address(i, s)
         if (target != (i, s)) Seq(target) else Seq()
       }}
      .groupByKey
      .collectAsMap
    

    create broadcast

    val messagesBD = sc.broadcast(messages)
    

    and use it to send messages:

    val transformed = tmp
      .filter{case (i, s) => address(i, s) == (i, s)}
      .mapPartitionsWithIndex((i, iter) => {
        val combined = iter ++ messagesBD.value.getOrElse(i, Seq())
        combined.map((i, _))
      }, true)
    
    transformed
      .mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size)))
      .collect
    
    // Array[(Int, Int)] = Array((0,4), (1,5), (2,7), (3,3))
    


来源:https://stackoverflow.com/questions/33547142/in-spark-can-i-find-out-the-machine-in-the-cluster-which-stores-a-given-element

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!