Creating array per Executor in Spark and combine into RDD

问题

I am moving from MPI based systems to Apache Spark. I need to do the following in Spark.

Suppose, I have n vertices. I want to create an edge list from these n vertices. An edge is just a tuple of two integers (u,v), no attributes are required.

However, I want to create them in parallel independently in each executor. Therefore, I want to create P edge arrays independently for P Spark Executors. Each array may be of different sizes and depends on the vertices, therefore, I also need the executor id from 0 to n-1. Next, I want to have a global RDD Array of edges.

In MPI, I would create an array in each processor using the processor rank. How do I do that in Spark, especially using the GraphX library?

Therefore, my primary goal is to create an array of edges in each executor and combine them into one single RDD.

I am first trying one modified version of the Erdos--Renyi model. As a parameter I only have the number of nodes n and a probability p.

Suppose, executor i has to process nodes from 101 to 200. For any node say, node 101, it will create edges from 101 to 102 -- n with probability p. After each executor creates the allocated edges, I would instantiate the GraphX EdgeRDD and VertexRDD. Therefore, my plan is to create the edge lists independently in each executor, and merge them into RDD.

回答1:

Lets start with some imports and variables which will be required for downstream processing:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import scala.util.Random
import org.apache.spark.HashPartitioner

val nPartitions: Integer = ???
val n: Long = ??? 
val p: Double = ???

Next we'll need an RDD of seed IDs which can be used to generate edges. A naive way to handle this would be simply something like this:

sc.parallelize(0L to n)

Since number of the generated edges depends on the node id this approach would give a highly skewed load. We can do a little bit better with repartitioning:

sc.parallelize(0L to n)
  .map((_, None))
  .partitionBy(new HashPartitioner(nPartitions))
  .keys

but much better approach is to start with empty RDD and generate ids in place. We'll need a small helper:

def genNodeIds(nPartitions: Int, n: Long)(i: Int) = {
  (0L until n).filter(_ % nPartitions == i).toIterator
}

which can be used as follows:

val empty = sc.parallelize(Seq.empty[Int], nPartitions)
val ids = empty.mapPartitionsWithIndex((i, _) => genNodeIds(nPartitions, n)(i))

Just a quick sanity check (it is quite expensive so don't use it in production):

require(ids.distinct.count == n)

and we can generate actual edges using another helper:

def genEdgesForId(p: Double, n: Long, random: Random)(i: Long) = {
  (i + 1 until n).filter(_ => random.nextDouble < p).map(j => Edge(i, j, ()))
}

def genEdgesForPartition(iter: Iterator[Long]) = {
  // It could be an overkill but better safe than sorry
  // Depending on your requirement it could worth to
  // consider using commons-math
  // https://commons.apache.org/proper/commons-math/userguide/random.html
  val random = new Random(new java.security.SecureRandom())
  iter.flatMap(genEdgesForId(p, n, random))
}

val edges = ids.mapPartitions(genEdgesForPartition)

Finally we can create a graph:

val graph = Graph.fromEdges(edges, ())

来源：https://stackoverflow.com/questions/34296588/creating-array-per-executor-in-spark-and-combine-into-rdd

标签

scala

apache-spark

spark-graphx