Creating array per Executor in Spark and combine into RDD

I am moving from MPI based systems to Apache Spark. I need to do the following in Spark.

Suppose, I have n vertices. I want to create an edge list from these n vertices. An edge is just a tuple of two integers (u,v), no attributes are required.

However, I want to create them in parallel independently in each executor. Therefore, I want to create P edge arrays independently for P Spark Executors. Each array may be of different sizes and depends on the vertices, therefore, I also need the executor id from 0 to n-1. Next, I want to have a global RDD Array of edges.

In MPI, I would create an array in each processor using the processor rank. How do I do that in Spark, especially using the GraphX library?

Therefore, my primary goal is to create an array of edges in each executor and combine them into one single RDD.

I am first trying one modified version of the Erdos--Renyi model. As a parameter I only have the number of nodes n and a probability p.

Suppose, executor i has to process nodes from 101 to 200. For any node say, node 101, it will create edges from 101 to 102 -- n with probability p. After each executor creates the allocated edges, I would instantiate the GraphX EdgeRDD and VertexRDD. Therefore, my plan is to create the edge lists independently in each executor, and merge them into RDD.

Lets start with some imports and variables which will be required for downstream processing:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import scala.util.Random
import org.apache.spark.HashPartitioner

val nPartitions: Integer = ???
val n: Long = ??? 
val p: Double = ???

Next we'll need an RDD of seed IDs which can be used to generate edges. A naive way to handle this would be simply something like this:

sc.parallelize(0L to n)

Since number of the generated edges depends on the node id this approach would give a highly skewed load. We can do a little bit better with repartitioning:

sc.parallelize(0L to n)
  .map((_, None))
  .partitionBy(new HashPartitioner(nPartitions))
  .keys

but much better approach is to start with empty RDD and generate ids in place. We'll need a small helper:

def genNodeIds(nPartitions: Int, n: Long)(i: Int) = {
  (0L until n).filter(_ % nPartitions == i).toIterator
}

which can be used as follows:

val empty = sc.parallelize(Seq.empty[Int], nPartitions)
val ids = empty.mapPartitionsWithIndex((i, _) => genNodeIds(nPartitions, n)(i))

Just a quick sanity check (it is quite expensive so don't use it in production):

require(ids.distinct.count == n)

and we can generate actual edges using another helper:

def genEdgesForId(p: Double, n: Long, random: Random)(i: Long) = {
  (i + 1 until n).filter(_ => random.nextDouble < p).map(j => Edge(i, j, ()))
}

def genEdgesForPartition(iter: Iterator[Long]) = {
  // It could be an overkill but better safe than sorry
  // Depending on your requirement it could worth to
  // consider using commons-math
  // https://commons.apache.org/proper/commons-math/userguide/random.html
  val random = new Random(new java.security.SecureRandom())
  iter.flatMap(genEdgesForId(p, n, random))
}

val edges = ids.mapPartitions(genEdgesForPartition)

Finally we can create a graph:

val graph = Graph.fromEdges(edges, ())

来源：https://stackoverflow.com/questions/34296588/creating-array-per-executor-in-spark-and-combine-into-rdd

标签

scala

apache-spark

spark-graphx