Cache and Query a Dataset In Parallel Using Spark

前端 未结 1 1198
北荒
北荒 2021-01-03 09:07

I have a requirement where I want to cache a dataset and then compute some metrics by firing \"N\" number of queries in parallel over that dataset and all these queries comp

相关标签:
1条回答
  • 2021-01-03 09:25

    It can be very simple to fire parallel queries in Spark's driver code using Scala's parallel collections. Here a minimal example how this could look like:

    val dfSrc = Seq(("Raphael",34)).toDF("name","age").cache()
    
    
    // define your queries, instead of returning a dataframe you could also write to a table etc
    val query1: (DataFrame) => DataFrame = (df:DataFrame) => df.select("name")
    val query2: (DataFrame) => DataFrame = (df:DataFrame) => df.select("age")
    
    // Fire queries in parallel
    import scala.collection.parallel.ParSeq
    ParSeq(query1,query2).foreach(query => query(dfSrc).show())
    

    EDIT:

    To collect Query-ID and Result in a map you should so:

    val resultMap  = ParSeq(
     (1,query1), 
     (2,query2)
    ).map{case (queryId,query) => (queryId,query(dfSrc))}.toMap
    
    0 讨论(0)
提交回复
热议问题