问题
In our Spark streaming app, with 60 second batches, we create a temp table over a DF, then run about 80 queries against it like:
sparkSession.sql("select ... from temp_view group by ...")
but given that these are fairly heavy queries with about 300 summed columns, it would be nice if we didn't have to analyze the sql and generate a query plan with every microbatch.
Isn't there a way to generate, cache and reuse a query plan? even saving just 50ms per query would save us about 4s per batch.
We're using Spark 2.2 on CDH/YARN. Thanks.
回答1:
I haven't tried it out before, but "to generate, cache and reuse a query plan" you should simply (re)use the query (it might not necessarily be the "shape" you usually work with, but there is one that may work for your case).
(Thinking aloud)
Every structured query (be it a Dataset, DataFrame or SQL) goes through phases, i.e. parsing, analysis, logical optimization, planning and physical optimization.
A structured query is described by its plans with a optimized physical query plan being the one you can see using Dataset.explain:
explain(): Unit Prints the physical plan to the console for debugging purposes.
scala> spark.version
res0: String = 2.3.1-SNAPSHOT
scala> :type q
org.apache.spark.sql.DataFrame
scala> q.explain
== Physical Plan ==
*(1) Project [id#0L, (id#0L * 2) AS x2#2L]
+- *(1) Range (0, 4, step=1, splits=8)
You don't work with the plan(s) directly, but the point is that you could. Another important point is that the plan(s) usually know nothing about the datasets they are optimized for (I said usually because Spark SQL has a cost-based optimizer that works with the data to give the most optimized query plan possible).
Whenever you execute an action, a query goes through so-called structured query execution pipeline. And it does the "preprocessing" every time an action is executed (even if that's the very same action). That's why you could cache the result, but that would tie the query with the data up forever (which you want to avoid).
With that said, I think you could do the optimizations before calling an action (and pumping data through the "pipes" of the query). Simply use the optimized physical query plan that you can generate using QueryExecution.rdd
that would give you the RDD that represents your structured query. With that RDD, you could simply RDD.[theAction]
every batch interval that would avoid all the stages that a structured query goes through to become an RDD.
scala> q.rdd
res2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[4] at rdd at <console>:26
You could even "optimize" the RDD by using QueryExecution.toRdd
instead.
scala> q.queryExecution.toRdd
res4: org.apache.spark.rdd.RDD[org.apache.spark.sql.catalyst.InternalRow] = MapPartitionsRDD[7] at toRdd at <console>:26
But (again, thinking aloud) all this reuse happens automatically since the stages are lazy vals so just...no it could not work...disregard the last "But" and stick to the idea of reusing the underlying RDD :) It should work.
BTW, That's pretty much what Spark Structured Streaming used to do every batch (interval) with micro-batching. That has changed in 2.3 though.
来源:https://stackoverflow.com/questions/49583401/how-to-avoid-query-preparation-parsing-planning-and-optimizations-every-time