Oozie | 易学教程

WorkFlow 工作流

阅读更多关于 WorkFlow 工作流

工作流工作流（Workflow），指“业务过程的部分或整体在计算机应用环境下的自动化”。是对工作流程及其各操作步骤之间业务规则的抽象、概括描述。工作流解决的主要问题是：为了实现某个业务目标，利用计算机软件在多个参与者之间按某种预定规则自动传递文档、信息或者任务。一个完整的数据分析系统通常都是由多个前后依赖的模块组合构成的：数据采集、数据预处理、数据分析、数据展示等。各个模块单元之间存在时间先后依赖关系，且存在着周期性重复。为了很好地组织起这样的复杂执行计划，需要一个工作流调度系统来调度执行。工作流调度实现方式简单的任务调度：直接使用linux的crontab来定义,但是缺点也是比较明显，无法设置依赖。复杂的任务调度：自主开发调度平台，使用开源调度系统，比如azkaban、Apache Oozie、Cascading、Hamake等。其中知名度比较高的是Apache Oozie，但是其配置工作流的过程是编写大量的XML配置，而且代码复杂度比较高，不易于二次开发。工作流调度工具之间对比下面的表格对四种hadoop工作流调度器的关键特性进行了比较，尽管这些工作流调度器能够解决的需求场景基本一致，但在设计理念，目标用户，应用场景等方面还是存在显著的区别，在做技术选型的时候，可以提供参考。特性 Hamake Oozie Azkaban Cascading

Scheduling spark jobs on a timely basis

阅读更多关于 Scheduling spark jobs on a timely basis

问题 Which is the recommended tool for scheduling Spark Jobs on a daily/weekly basis. 1) Oozie 2) Luigi 3) Azkaban 4) Chronos 5) Airflow Thanks in advance. 回答1: Updating my previous answer from here: Suggestion for scheduling tool(s) for building hadoop based data pipelines Airflow: Try this first. Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird. Airflow has built in support for the fact that jobs scheduled jobs often need to be

Add Spark to Oozie shared lib

阅读更多关于 Add Spark to Oozie shared lib

问题 By default, Oozie shared lib directory provides libraries for Hive, Pig, and Map-Reduce. If I want to run Spark job on Oozie, it might be better to add Spark lib jars to Oozie's shared lib instead of copy them to app's lib directory. How can I add Spark lib jars (including spark-core and its dependencies) to Oozie's shared lib? Any comment / answer is appreciated. 回答1: Spark action is scheduled to be released with Oozie 4.2.0, even though the doc seems to be a bit behind. See related JIRA

Oozie job submission fails

阅读更多关于 Oozie job submission fails

问题 I am trying to submit an example map reduce oozie job and all the properties are configured properly with regards to the path and name node and job-tracker port etc. I validated the workflow.xml too . when I deploy the job I get a job id and when I check the status I see a status KILLED and the details basically say that /var/tmp/oozie/oozie-oozi7188507762062318929.dir/map-reduce-launcher.jar does not exist. 回答1: In order to resolve this error, just crate hdfs folders and give appropriate

Oozie Workflow EL function timestamp() does not give seconds

阅读更多关于 Oozie Workflow EL function timestamp() does not give seconds

问题 I have the following Oozie workflow: <workflow-app name="${workflow_name}" xmlns="uri:oozie:workflow:0.4"> <global> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${launcherQueueName}</value> </property> <property> <name>mapred.queue.name</name> <value>${launcherQueueName}</value> </property> </configuration> </global> <start to="email-1" /> <action name="email-1"> <email xmlns="uri:oozie:email

AirFlow dag id access in sub-tasks

阅读更多关于 AirFlow dag id access in sub-tasks

问题 I have a DAG with three bash tasks which is scheduled to run every day. I would like to access unique ID of dag instance(may be PID) in all bash scripts. Is there any way to do this? I am looking for similar functionality as Oozie where we can access WORKFLOW_ID in workflow xml or java code. Can somebody point me to documentation of AirFlow on "How to use in-build and custom variables in AirFlow DAG" Many Thanks Pari 回答1: Object's attributes can be accessed with dot notation in jinja2 (see

Hue集成Oozie

阅读更多关于 Hue集成Oozie

5.1．修改hue配置文件hue.ini [liboozie] # The URL where the Oozie service runs on. This is required in order for # users to submit jobs. Empty value disables the config check. oozie_url=http://node-1:11000/oozie # Requires FQDN in oozie_url if enabled ## security_enabled=false # Location on HDFS where the workflows/coordinator are deployed when submitted. remote_deployement_dir=/user/root/oozie_works [oozie] # Location on local FS where the examples are stored. # local_data_dir=/export/servers/oozie-4.1.0-cdh5.14.0/examples/apps # Location on local FS where the data for the examples is stored. #

sqoop exec job in oozie is not working

阅读更多关于 sqoop exec job in oozie is not working

问题 I am running a 3 node HDP 2.2 cluster. Oozie version is 4.1.0.2.2 and Sqoop version is 1.4.5.2.2. I am using Sqoop job to do incremental imports from RDBMS into HDFS as shown below, sqoop job –create JOB1 –meta-connect “jdbc:hsqldb:hsql://ip-address:16000/sqoop” — import –connect jdbc:oracle:thin:@ip-address:db –username db_user –password-file hdfs://ip-address:8020/user/oozie/.password_sqoop –table TABLE1 –target-dir /user/incremental/ –incremental lastmodified –check-column LAST_UPDATED

how to debug failed oozie workflows on Analytics for Apache Hadoop?

阅读更多关于 how to debug failed oozie workflows on Analytics for Apache Hadoop?

问题 I'm trying to run an oozie workflow on Bluemix Analytics for Apache Hadoop , but it it failing. The output from calling Workflow status is as follows: ... { "errorMessage": "Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]", "status": "ERROR", "stats": null, "data": null, "transition": "fail", "externalStatus": "FAILED/KILLED", "cred": "null", "conf": "<shell xmlns=\"uri:oozie:shell-action:0.2\"> <job-tracker>****:8050</job-tracker> <name-node>hdfs://****:8020</name-node>

How to create Oozie workflow dependencies in hue --workflow--Editor

阅读更多关于 How to create Oozie workflow dependencies in hue --workflow--Editor

问题 CDH 5.5.2 (hue --workflow--Editor) ================================== we are sqooping data from different system ERP system, so we are created different independent oozie workflow using the editor(hue --workflow--Editor) (ex:- ERP1_workflow1, ERP2_workflow2,ERP3_workflow3, ERP4_workflow...ERP7) each workflow has 4o to 50 sqoop command.hence we created different workflow to run parallel.this will run daily @ particular time.Now we have one more workflow "final_workflow" which has few hive