Cache and Query a Dataset In Parallel Using Spark

前端未结

关注

 1  1198

I have a requirement where I want to cache a dataset and then compute some metrics by firing \"N\" number of queries in parallel over that dataset and all these queries comp

相关标签:

1条回答

梦如初夏

2021-01-03 09:25

It can be very simple to fire parallel queries in Spark's driver code using Scala's parallel collections. Here a minimal example how this could look like:

val dfSrc = Seq(("Raphael",34)).toDF("name","age").cache()


// define your queries, instead of returning a dataframe you could also write to a table etc
val query1: (DataFrame) => DataFrame = (df:DataFrame) => df.select("name")
val query2: (DataFrame) => DataFrame = (df:DataFrame) => df.select("age")

// Fire queries in parallel
import scala.collection.parallel.ParSeq
ParSeq(query1,query2).foreach(query => query(dfSrc).show())

EDIT:

To collect Query-ID and Result in a map you should so:

val resultMap  = ParSeq(
 (1,query1), 
 (2,query2)
).map{case (queryId,query) => (queryId,query(dfSrc))}.toMap

0 讨论(0)