Parallelize / avoid foreach loop in spark

后端 未结 3 949
执念已碎
执念已碎 2020-12-09 06:02

I wrote a class that gets a DataFrame, does some calculations on it and can export the results. The Dataframes are generated by a List of Keys. I know that i am doing this i

3条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-09 06:12

    You can use scala's Future and Spark Fair Scheduling, e.g.

    import scala.concurrent._
    import scala.concurrent.duration._
    import ExecutionContext.Implicits.global
    
    object YourApp extends App { 
      val sc = ... // SparkContext, be sure to set spark.scheduler.mode=FAIR
      var pool = 0
      // this is to have different pools per job, you can wrap it to limit the no. of pools
      def poolId = {
        pool = pool + 1
        pool
      }
      def runner(i: Int) = Future {
        sc.setLocalProperty("spark.scheduler.pool", poolId)
        val data:DataFrame = DataContainer.getDataFrame(i) // get DataFrame
        val x = new MyClass(data)                     // initialize MyClass with new Object
        x.setSettings(...)
        x.calcSomething()
        x.saveResults()
      }
    
      val l = List(34, 32, 132, 352)      // Scala List
      val futures = l map(i => runner(i))
    
      // now you need to wait all your futures to be completed
      futures foreach(f => Await.ready(f, Duration.Inf))
    
    }
    

    With FairScheduler and different pools, each concurrent job will have a fair share of the spark cluster resources.

    Some reference regarding scala's future here. You might need to add necessary callbacks on completion, success, and/or failures.

提交回复
热议问题