Airflow HiveCliHook connection to remote hive cluster?

点点圈 提交于 2019-12-29 09:10:35

问题


I am trying to connect to my hive server from a local copy of Airflow, but it seems like the HiveCliHook is trying to connect to my local copy of Hive.

I'm running to following to test it:

import airflow
from airflow.models import Connection
from airflow.hooks.hive_hooks import  HiveCliHook

usr = 'myusername'
pss = 'mypass'

session = airflow.settings.Session()
hive_cli = session.query(Connection).filter(Connection.conn_id == 'hive_cli_default').all()[0]

hive_cli.host = 'hive_server.test.mydomain.com'
hive_cli.port = '9083'
hive_cli.login = usr
hive_cli.password = pss
hive_cli.schema = 'default'

session.commit()

hive = HiveCliHook()

hive.run_cli("select 1")

Which is throwing this error:

[2018-11-28 13:23:22,667] {base_hook.py:83} INFO - Using connection to: hive_server.test.mydomain.com
[2018-11-28 13:24:50,891] {hive_hooks.py:220} INFO - hive -f /tmp/airflow_hiveop_2Fdl2I/tmpBFoGp7  
[2018-11-28 13:24:55,548] {hive_hooks.py:235} INFO - Logging initialized using configuration in jar:file:/usr/local/apache-hive-2.3.4-bin/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true  
[2018-11-28 13:25:01,776] {hive_hooks.py:235} INFO - FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Does anyone have any idea where I'm going wrong?


回答1:


  • While you can use the HiveCliOperator (unaltered) for connecting and executing HQL statements in remote Hive-Server, the only requirement is that the box that is running your Airflow worker must also contain Hive binaries installed

  • This is so because the hive-cli command prepared by HiveCliHook would be run in worker machine via good-old bash. At this stage, if Hive CLI is not installed in the machine where this code is running (i.e. your Airflow worker), it will break as in your case


Straight-forward workaround is to implement your own RemoteHiveCliOperator that

  • Creates an SSHHook to the remote Hive-server machine
  • And execute your HQL statement via SSHHook like this

In fact this seems to be a universal drawback with almost all Airflow Operators that by default they expect requisite packages installed in every worker. The docs warn about it

For example, if you use the HiveOperator, the hive CLI needs to be installed on that box



来源:https://stackoverflow.com/questions/53528673/airflow-hiveclihook-connection-to-remote-hive-cluster

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!