Spark 2: how does it work when SparkSession enableHiveSupport() is invoked

问题

My question is rather simple, but somehow I cannot find a clear answer by reading the documentation.

I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore.

I create a session in my Spark program as follows:

SparkSession spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()

Suppose I have the following HiveQL query:

spark.sql("SELECT someColumn FROM someTable")

I would like to know whether:

under the hood this query is translated into Hive MapReduce primitives, or
the support for HiveQL is only at a syntactical level and Spark SQL will be used under the hood.

I am doing some performance evaluation and I don't know whether I should claim the time performance of queries executed with spark.sql([hiveQL query]) refer to Spark or Hive.

回答1:

Spark knows two catalogs, hive and in-memory. If you set enableHiveSupport(), then spark.sql.catalogImplementation is set to hive, otherwise to in-memory. So if you enable hive support, spark.catalog.listTables().show() will show you all tables from the hive metastore.

But this does not mean hive is used for the query*, it just means that spark communicates with the hive-metastore, the execution engine is always spark.

*there are actually some functions like percentile und percentile_approx which are native hive UDAF.

回答2:

Setting enableHiveSupport doesn't mean, that query is calculated in Hive.

It's only about Hive catalog. If you use enableHiveSupport, then you can:

write and read to/from Hive persistent metastore
use Hive's UDFs
use Hive's SerDe

All of it is connected directly with Catalog, not execution itself

Historically also Hive QL parsing was done using Hive, but now Spark does it without calling Hive

I should claim the time performance of queries executed with spark.sql([hiveQL query]) refer to Spark or Hive.

As stated above, it's performance of Spark

回答3:

There are three execution engine, MapReduce, Tez and Spark.

When you execute query using hive you can set one of this engine, your admins must have set one of the engine as default engine.

i.e.

set hive.execution.engine=tez;

When you execute the query using Spark, it will use the spark engine to execute the query.

However, if you are doing performance analysis Time is not the only thing you should mesure, memory and CPI should be measured as well.

回答4:

"under the hood this query is translated into Hive MapReduce primitives, or the support for HiveQL is only at a syntactical level and Spark SQL will be used under the hood."

I use spark SQL on Hive metastore. The way I verifid whether query is translated into Map/Reduce or not is to check: a. Open Hive console and run a simple SELECT query with some filter. Now go to YARN resource manager. You will see some Map/reduce jobs getting fired as a result of query execution. b. Run spark SQL using HiveContext and execute the same SQL query. Spark SQL will capitalize on the metastore information of Hive without triggering Map/Reduce jobs. Go to Resource Manager in YARN and verify it. You will only find the spark-shell session running and there is no additional Map/Reduce job that gets fired on the cluster.

来源：https://stackoverflow.com/questions/52169175/spark-2-how-does-it-work-when-sparksession-enablehivesupport-is-invoked

标签

apache-spark

Hive

apache-spark-sql

hiveql