I wrote a class that gets a DataFrame, does some calculations on it and can export the results. The Dataframes are generated by a List of Keys. I know that i am doing this i
You can use scala's Future and Spark Fair Scheduling, e.g.
import scala.concurrent._
import scala.concurrent.duration._
import ExecutionContext.Implicits.global
object YourApp extends App {
val sc = ... // SparkContext, be sure to set spark.scheduler.mode=FAIR
var pool = 0
// this is to have different pools per job, you can wrap it to limit the no. of pools
def poolId = {
pool = pool + 1
pool
}
def runner(i: Int) = Future {
sc.setLocalProperty("spark.scheduler.pool", poolId)
val data:DataFrame = DataContainer.getDataFrame(i) // get DataFrame
val x = new MyClass(data) // initialize MyClass with new Object
x.setSettings(...)
x.calcSomething()
x.saveResults()
}
val l = List(34, 32, 132, 352) // Scala List
val futures = l map(i => runner(i))
// now you need to wait all your futures to be completed
futures foreach(f => Await.ready(f, Duration.Inf))
}
With FairScheduler and different pools, each concurrent job will have a fair share of the spark cluster resources.
Some reference regarding scala's future here. You might need to add necessary callbacks on completion, success, and/or failures.