问题
I am trying to connect to my hive server from a local copy of Airflow, but it seems like the HiveCliHook is trying to connect to my local copy of Hive.
I'm running to following to test it:
import airflow
from airflow.models import Connection
from airflow.hooks.hive_hooks import HiveCliHook
usr = 'myusername'
pss = 'mypass'
session = airflow.settings.Session()
hive_cli = session.query(Connection).filter(Connection.conn_id == 'hive_cli_default').all()[0]
hive_cli.host = 'hive_server.test.mydomain.com'
hive_cli.port = '9083'
hive_cli.login = usr
hive_cli.password = pss
hive_cli.schema = 'default'
session.commit()
hive = HiveCliHook()
hive.run_cli("select 1")
Which is throwing this error:
[2018-11-28 13:23:22,667] {base_hook.py:83} INFO - Using connection to: hive_server.test.mydomain.com
[2018-11-28 13:24:50,891] {hive_hooks.py:220} INFO - hive -f /tmp/airflow_hiveop_2Fdl2I/tmpBFoGp7
[2018-11-28 13:24:55,548] {hive_hooks.py:235} INFO - Logging initialized using configuration in jar:file:/usr/local/apache-hive-2.3.4-bin/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true
[2018-11-28 13:25:01,776] {hive_hooks.py:235} INFO - FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
Does anyone have any idea where I'm going wrong?
回答1:
While you can use the
HiveCliOperator
(unaltered) for connecting and executingHQL
statements in remoteHive-Server
, the only requirement is that the box that is running yourAirflow
worker
must also containHive
binaries installedThis is so because the hive-cli command prepared by HiveCliHook would be run in worker machine via good-old
bash
. At this stage, ifHive CLI
is not installed in the machine where this code is running (i.e. your Airflow worker), it will break as in your case
Straight-forward workaround is to implement your own RemoteHiveCliOperator
that
- Creates an SSHHook to the remote Hive-server machine
- And execute your HQL statement via SSHHook like this
In fact this seems to be a universal drawback with almost all Airflow Operator
s that by default they expect requisite packages installed in every worker. The docs warn about it
For example, if you use the HiveOperator, the hive CLI needs to be installed on that box
来源:https://stackoverflow.com/questions/53528673/airflow-hiveclihook-connection-to-remote-hive-cluster