I have a requirement where I want to cache a dataset and then compute some metrics by firing \"N\" number of queries in parallel over that dataset and all these queries comp
It can be very simple to fire parallel queries in Spark's driver code using Scala's parallel collections. Here a minimal example how this could look like:
val dfSrc = Seq(("Raphael",34)).toDF("name","age").cache()
// define your queries, instead of returning a dataframe you could also write to a table etc
val query1: (DataFrame) => DataFrame = (df:DataFrame) => df.select("name")
val query2: (DataFrame) => DataFrame = (df:DataFrame) => df.select("age")
// Fire queries in parallel
import scala.collection.parallel.ParSeq
ParSeq(query1,query2).foreach(query => query(dfSrc).show())
EDIT:
To collect Query-ID and Result in a map you should so:
val resultMap = ParSeq(
(1,query1),
(2,query2)
).map{case (queryId,query) => (queryId,query(dfSrc))}.toMap