Spark 2: how does it work when SparkSession enableHiveSupport() is invoked

时间秒杀一切 提交于 2019-12-10 15:47:56

问题


My question is rather simple, but somehow I cannot find a clear answer by reading the documentation.

I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore.

I create a session in my Spark program as follows:

SparkSession spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()

Suppose I have the following HiveQL query:

spark.sql("SELECT someColumn FROM someTable")

I would like to know whether:

  1. under the hood this query is translated into Hive MapReduce primitives, or
  2. the support for HiveQL is only at a syntactical level and Spark SQL will be used under the hood.

I am doing some performance evaluation and I don't know whether I should claim the time performance of queries executed with spark.sql([hiveQL query]) refer to Spark or Hive.


回答1:


Spark knows two catalogs, hive and in-memory. If you set enableHiveSupport(), then spark.sql.catalogImplementation is set to hive, otherwise to in-memory. So if you enable hive support, spark.catalog.listTables().show() will show you all tables from the hive metastore.

But this does not mean hive is used for the query*, it just means that spark communicates with the hive-metastore, the execution engine is always spark.

*there are actually some functions like percentile und percentile_approx which are native hive UDAF.




回答2:


Setting enableHiveSupport doesn't mean, that query is calculated in Hive.

It's only about Hive catalog. If you use enableHiveSupport, then you can:

  • write and read to/from Hive persistent metastore
  • use Hive's UDFs
  • use Hive's SerDe

All of it is connected directly with Catalog, not execution itself

Historically also Hive QL parsing was done using Hive, but now Spark does it without calling Hive

I should claim the time performance of queries executed with spark.sql([hiveQL query]) refer to Spark or Hive.

As stated above, it's performance of Spark




回答3:


There are three execution engine, MapReduce, Tez and Spark.

When you execute query using hive you can set one of this engine, your admins must have set one of the engine as default engine.

i.e.

set hive.execution.engine=tez;

When you execute the query using Spark, it will use the spark engine to execute the query.

However, if you are doing performance analysis Time is not the only thing you should mesure, memory and CPI should be measured as well.




回答4:


"under the hood this query is translated into Hive MapReduce primitives, or the support for HiveQL is only at a syntactical level and Spark SQL will be used under the hood."

I use spark SQL on Hive metastore. The way I verifid whether query is translated into Map/Reduce or not is to check: a. Open Hive console and run a simple SELECT query with some filter. Now go to YARN resource manager. You will see some Map/reduce jobs getting fired as a result of query execution. b. Run spark SQL using HiveContext and execute the same SQL query. Spark SQL will capitalize on the metastore information of Hive without triggering Map/Reduce jobs. Go to Resource Manager in YARN and verify it. You will only find the spark-shell session running and there is no additional Map/Reduce job that gets fired on the cluster.



来源:https://stackoverflow.com/questions/52169175/spark-2-how-does-it-work-when-sparksession-enablehivesupport-is-invoked

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!