In Spark, can I find out the machine in the cluster which stores a given element in RDD and then send message to it?

问题

I am new to Spark.

I want to know if in an RDD, for example, RDD = {"0", "1", "2",... "99999"}, can I find out the machine in the cluster which stores a given element (e.g.: 100)?

And then in shuffle, can I aggregate some data and send it to the certain machine? I know that the partition of RDD is transparent for users but could I use some method like key/value to achieve that?

回答1:

Generally speaking the answer is no or at least not with RDD API. If you can express your logic using graphs then you can try message based API in GraphX or Giraph. If not then using Akka directly instead of Spark could be a better choice.

Still, there are some workarounds but I wouldn't expect high performance. Lets start with some dummy data:

import org.apache.spark.rdd.RDD

val toPairs = (s: Range) => s.map(_.toChar.toString)

val rdd: RDD[(Int, String)] = sc.parallelize(Seq(
  (0, toPairs(97 to 100)), // a-d
  (1, toPairs(101 to 107)), // e-k
  (2, toPairs(108 to 115)) // l-s
)).flatMap{ case (i, vs) => vs.map(v => (i, v)) }

and partition it using custom partitioner:

import org.apache.spark.Partitioner

class IdentityPartitioner(n: Int) extends Partitioner {
  def numPartitions: Int = n
  def getPartition(key: Any): Int = key.asInstanceOf[Int]
}

val partitioner = new IdentityPartitioner(4)
val parts = rdd.partitionBy(partitioner)

Now we have RDD with 4 partitions including one empty:

parts.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size))).collect
// Array[(Int, Int)] = Array((0,4), (1,7), (2,8), (3,0))

The simplest thing you can do is to leverage partitioning itself. First a dummy function and a helper:

// Dummy map function
def transform(s: String) =
  Map("e" -> "x", "k" -> "y", "l" -> "z").withDefault(identity)(s)

// Map String to partition
def address(curr: Int, s: String) = {
  val m = Map("x" -> 3, "y" -> 3, "z" -> 3).withDefault(x => curr)
  (m(s), s)
}

and "send" data:

val transformed: RDD[(Int, String)] = parts
  // Emit pairs (partition, string)
  .map{case (i, s) => address(i, transform(s))}
  // Repartition
  .partitionBy(partitioner)

transformed
  .mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size)))
  .collect
// Array[(Int, Int)] = Array((0,4), (1,5), (2,7), (3,3))

another approach is to collect "messages":

val tmp = parts.mapValues(s => transform(s))

val messages: Map[Int,Iterable[String]] = tmp
  .flatMap{case (i, s) => {
     val target = address(i, s)
     if (target != (i, s)) Seq(target) else Seq()
   }}
  .groupByKey
  .collectAsMap

create broadcast

val messagesBD = sc.broadcast(messages)

and use it to send messages:

val transformed = tmp
  .filter{case (i, s) => address(i, s) == (i, s)}
  .mapPartitionsWithIndex((i, iter) => {
    val combined = iter ++ messagesBD.value.getOrElse(i, Seq())
    combined.map((i, _))
  }, true)

transformed
  .mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size)))
  .collect

// Array[(Int, Int)] = Array((0,4), (1,5), (2,7), (3,3))

来源：https://stackoverflow.com/questions/33547142/in-spark-can-i-find-out-the-machine-in-the-cluster-which-stores-a-given-element

标签

apache-spark

rdd