Parallelize / avoid foreach loop in spark

后端 未结 3 940
执念已碎
执念已碎 2020-12-09 06:02

I wrote a class that gets a DataFrame, does some calculations on it and can export the results. The Dataframes are generated by a List of Keys. I know that i am doing this i

相关标签:
3条回答
  • 2020-12-09 06:09

    You can use scala's parallel collections to achieve foreach parallelism on the driver side.

    val l = List(34, 32, 132, 352).par
    l.foreach{i => // your code to be run in parallel for each i}
    

    *However, a word of caution: is your cluster capable of running jobs parallely? You may submit the jobs to your spark cluster parallely but they may end up getting queued on the cluster and get executed sequentially.

    0 讨论(0)
  • 2020-12-09 06:12

    You can use scala's Future and Spark Fair Scheduling, e.g.

    import scala.concurrent._
    import scala.concurrent.duration._
    import ExecutionContext.Implicits.global
    
    object YourApp extends App { 
      val sc = ... // SparkContext, be sure to set spark.scheduler.mode=FAIR
      var pool = 0
      // this is to have different pools per job, you can wrap it to limit the no. of pools
      def poolId = {
        pool = pool + 1
        pool
      }
      def runner(i: Int) = Future {
        sc.setLocalProperty("spark.scheduler.pool", poolId)
        val data:DataFrame = DataContainer.getDataFrame(i) // get DataFrame
        val x = new MyClass(data)                     // initialize MyClass with new Object
        x.setSettings(...)
        x.calcSomething()
        x.saveResults()
      }
    
      val l = List(34, 32, 132, 352)      // Scala List
      val futures = l map(i => runner(i))
    
      // now you need to wait all your futures to be completed
      futures foreach(f => Await.ready(f, Duration.Inf))
    
    }
    

    With FairScheduler and different pools, each concurrent job will have a fair share of the spark cluster resources.

    Some reference regarding scala's future here. You might need to add necessary callbacks on completion, success, and/or failures.

    0 讨论(0)
  • 2020-12-09 06:25

    I did this using something like using List.par.foreach{object => print(object)}. I am using Zeppelin on Spark 2.3. I have a similar use case where I need to get the data day by day, and process it separately. This cannot be done using a whole month data because of some join conditions on the tables I'm using. Here is a sample of my code:

    import java.time.LocalDate
    import java.sql.Date
    
    var start =  LocalDate.of(2019, 1, 1)
    val end   =  LocalDate.of(2019, 2, 1)
    var list : List[LocalDate] = List()
    
    var usersDf = spark.read.load("s3://production/users/")
    usersDf.createOrReplaceTempView("usersDf")
    
    while (start.isBefore(end)){
        list = start :: list
        start = start.plusDays(1)
    }
    
    list.par.foreach{ loopDate =>
        //println(start)
        var yesterday = loopDate.plusDays(-1)
        var tomorrow = loopDate.plusDays(1)
        var lastDay = yesterday.getDayOfMonth()
        var lastMonth = yesterday.getMonthValue()
        var lastYear = yesterday.getYear()
    
        var day = loopDate.getDayOfMonth()
        var month = loopDate.getMonthValue()
        var year = loopDate.getYear()
        var dateDay = loopDate
    
        var condition: String = ""
        if (month == lastMonth) {
            condition = s"where year = $year and month = $month and day in ($day, $lastDay)"
        } else {
            condition = s"""where ((year = $year and month = $month and day = $day) or
            (year = $lastYear and month = $lastMonth and day = $lastDay)) 
            """
        }
    
        //Get events in local timezone
        var aggPbDf = spark.sql(s"""
                with users as (
                select * from users
                where account_creation_date < '$tomorrow'
            )
            , cte as (
                select e.* date(from_utc_timestamp(to_timestamp(concat(e.year,'-', e.month, '-', e.day, ' ', e.hour), 'yyyy-MM-dd HH'), coalesce(u.timezone_name, 'UTC'))) as local_date
                from events.user_events e
                left join users u
                on u.account_id = e.account_id
                $condition)
            select * from cte
            where local_date = '$dateDay'
        """
        )
        aggPbDf.write.mode("overwrite")
            .format("parquet")
            .save(s"s3://prod-bucket/events/local-timezone/date_day=$dateDay")
    }
    

    This will get the data for two days, process it, then write out only the desired output. Running this without par will take about 15 minutes per day, but with par it took 1 hour for the whole month. This will also depend on what your spark cluster can support and the size of your data.

    0 讨论(0)
提交回复
热议问题