Scheduling an ad-hoc query with Hive/Hadoop using Oozie

こ雲淡風輕ζ 提交于 2019-12-24 20:36:04

问题


Does Oozie support a user scheduling, via a REST API, an ad-hoc Hive query?

We're building a system where a user can search documents in Hadoop, with support for the user (optionally) specifying some attributes of the data to be searched, using Hive to perform the query against Hadoop. Because of this support for optional fields, we don't know ahead of time what the Hive query will look like (in terms of which tables will be used in the Hive query). We have a service where, at run-time, we process the user's query to generate the corresponding Hive query.

We'd like to be able to schedule these queries via Oozie, but I haven't been able to find documentation on how to perform this via Oozie. I assume this is possible. Is there sample Java code available to describe how to perform this operation?


回答1:


Use the Oozie Coordinator to schedule jobs, Apache documentation here and an example here for Oozie Coordinator. Also, take a look at Azkaban (1, 2) for scheduling.




回答2:


Proxy Hive Job Submission via the REST API allows users to submit jobs without creating a workflow XML on HDFS:

  • https://oozie.apache.org/docs/5.1.0/WebServicesAPI.html#Proxy_Hive_Job_Submission

You can also use FluentAPI to programatically build workflows:

  • https://oozie.apache.org/docs/5.1.0/DG_FluentJobAPI.html#A_More_Verbose_Example
  • https://github.com/apache/oozie/blob/master/fluent-job/fluent-job-api/src/test/java/org/apache/oozie/fluentjob/api/action/TestHive2ActionBuilder.java

As mentioned above, Oozie Coordinator can be used to schedule & regularly execute workflows. Beyond time dependency, you can also define data dependencies (such as existence of specific files on HDFS) for starting a workflow.



来源:https://stackoverflow.com/questions/23275414/scheduling-an-ad-hoc-query-with-hive-hadoop-using-oozie

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!