Oozie 入门 | 易学教程

1 Oozie 简介

一个基于工作流引擎的开源框架，提供对 Hadoop MapReduce、Pig Jobs 的任务调度与协调，主要用于定时调度任务，多任务可以按照执行的逻辑顺序调度。

2 功能模块

2.1 模块

1、Workflow

顺序执行流程节点，支持 fork（分支多个节点），join（合并多个节点为一个）

2、Coordinator

定时触发 workflow

3、Bundle

绑定多个 Coordinator

2.2 常用节点

控制流节点（Control Flow Nodes**）**

控制流节点一般都是定义在工作流开始或者结束的位置，比如start,end,kill 等，以及提供工作流的执行路径机制，如decision，fork，join 等。

动作节点（Action Nodes**）**

负责执行具体动作的节点，比如：拷贝文件，执行某个 Shell 脚本等等

3 安装部署

3.1 Hadoop

配置 core-site.xml

<configuration>
    <!-- 指定HDFS中NameNode的地址 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop101:8020</value>
    </property>
    <!-- 指定Hadoop运行时产生文件的存储目录 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp</value>
    </property>
</configuration>

配置 hadoop-env.sh

#修改JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_144

配置 hdfs-site.xml

<configuration>
    <!-- 指定HDFS副本的数量 -->
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <!-- 指定Hadoop辅助名称节点主机配置 -->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop104:50090</value>
    </property>
</configuration>

配置 yarn-env.sh

#修改JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_144

配置 yarn-site.xml

<configuration>
    <!-- Reducer获取数据的方式 -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <!-- 指定YARN的ResourceManager的地址 -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop103</value>
    </property>
    <!-- 日志聚集功能使能 -->
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <!-- 日志保留时间设置7天 -->
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
    </property>
</configuration>

配置 mapred-env.sh

#修改JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_144

配置 mapred-site.xml

<configuration>
    <!-- 指定MR运行在YARN上 -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <!-- 历史服务器端地址 -->
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>hadoop104:10020</value>
    </property>
    <!-- 历史服务器web端地址 -->
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>hadoop104:19888</value>
    </property>
</configuration>

配置 salves

hadoop102
hadoop103
hadoop104

在集群上分发配置好的 Hadoop 配置文件

[djm@hadoop102 ~]$ xsync /opt/module/hadoop-2.7.2/etc/hadoop/

启动集群

[djm@hadoop102 hadoop-2.5.0-cdh5.3.6]$ sbin/start-dfs.sh
[djm@hadoop103 hadoop-2.5.0-cdh5.3.6]$ sbin/start-yarn.sh
[djm@hadoop102 hadoop-2.5.0-cdh5.3.6]$ sbin/mr-jobhistory-daemon.sh start historyserver

3.2 Oozie

解压 oozie

[djm@hadoop102 software]$ tar -zxvf /opt/software/cdh/oozie-4.0.0-cdh5.3.6.tar.gz -C /opt/module

在 oozie 根目录下解压 oozie-hadooplibs-4.0.0-cdh5.3.6.tar.gz

[atguigu@hadoop102 oozie-4.0.0-cdh5.3.6]$ tar -zxvf oozie-hadooplibs-4.0.0-cdh5.3.6.tar.gz -C ../

在 oozie 目录下创建 libext 目录

[atguigu@hadoop102 oozie-4.0.0-cdh5.3.6]$ mkdir libext/

拷贝依赖的 jar 包

[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ cp -ra hadooplibs/hadooplib-2.5.0-cdh5.3.6.oozie-4.0.0-cdh5.3.6/* libext/
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ cp -a /opt/software/mysql-connector-java-5.1.27-bin.jar ./libext/

将 ext-2.2.zip 拷贝到 libext 目录下

[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ cp -a /opt/software/cdh/ext-2.2.zip libext/

修改 oozie-site.xml

<configuration>
    <property>
        <name>oozie.service.JPAService.jdbc.driver</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>

    <property>
        <name>oozie.service.JPAService.jdbc.url</name>
        <value>jdbc:mysql://hadoop102:3306/oozie</value>
    </property>

    <property>
        <name>oozie.service.JPAService.jdbc.username</name>
        <value>root</value>
    </property>

    <property>
        <name>oozie.service.JPAService.jdbc.password</name>
        <value>123456</value>
    </property>
    <property>
        <name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
        <value>*=/opt/module/cdh/hadoop-2.5.0-cdh5.3.6/etc/hadoop</value>
    </property>
</configuration>

初始化 oozie

#进入MySQL并创建oozie数据库:
create database oozie;
#上传Oozie目录下的yarn.tar.gz文件到HDFS：
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozie-setup.sh sharelib create -fs hdfs://hadoop102:8020 -locallib oozie-sharelib-4.0.0-cdh5.3.6-yarn.tar.gz
#创建oozie.sql文件
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/ooziedb.sh create -sqlfile oozie.sql -run
#打包项目，生成war包
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozie-setup.sh prepare-war

oozie 的启动与关闭

启动命令如下：
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozied.sh start
关闭命令如下：
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozied.sh stop

访问 Web 界面

http://hadoop102:11000/oozie

4 实战案例

4.1 单节点工作流

1、创建工作目录

[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ mkdir -p oozie-apps/shell

2、在 oozie-apps/shell 目录下创建 workflow.xml、job.properties

[djm@hadoop102 shell]$ touch workflow.xml
[djm@hadoop102 shell]$ touch job.properties

3、编辑 workflow.xml

<workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf">
<!--开始节点-->
<start to="shell-node"/>
<!--动作节点-->
<action name="shell-node">
    <!--shell动作-->
    <shell xmlns="uri:oozie:shell-action:0.2">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapred.job.queue.name</name>
                <value>${queueName}</value>
            </property>
        </configuration>
        <!--要执行的脚本-->
        <exec>mkdir</exec>
        <argument>/opt/module/d</argument>
        <capture-output/>
    </shell>
    <ok to="end"/>
    <error to="fail"/>
</action>
<!--kill节点-->
<kill name="fail">
    <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<!--结束节点-->
<end name="end"/>
</workflow-app>

4、编辑 job.properties

#HDFS地址
nameNode=hdfs://hadoop102:8020
#ResourceManager地址
jobTracker=hadoop103:8032
#队列名称
queueName=default
examplesRoot=oozie-apps
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/shell

5、上传配置

[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -put oozie-apps/ /user/djm

6、执行任务

[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozie job -oozie http://hadoop102:11000/oozie -config oozie-apps/shell/job.properties -run

4.2 多节点工作流

1、编辑 workflow.xml

<workflow-app
    xmlns="uri:oozie:workflow:0.4" name="shell-wf">
    <start to="p1-shell-node"/>
    <action name="p1-shell-node">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>mkdir</exec>
            <argument>/opt/module/d1</argument>
            <capture-output/>
        </shell>
        <ok to="forking"/>
        <error to="fail"/>
    </action>
    <fork name="forking">
        <path start="p2-shell-node" />
        <path start="p3-shell-node" />
    </fork>
    <action name="p2-shell-node">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>mkdir</exec>
            <argument>/opt/module/d2</argument>
            <capture-output/>
        </shell>
        <ok to="joining"/>
        <error to="fail"/>
    </action>
    <action name="p3-shell-node">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>mkdir</exec>
            <argument>/opt/module/d3</argument>
            <capture-output/>
        </shell>
        <ok to="joining"/>
        <error to="fail"/>
    </action>
    <join name="joining" to="p4-shell-node"/>
    <action name="p4-shell-node">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>mkdir</exec>
            <argument>/opt/module/d4</argument>
            <capture-output/>
        </shell>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

2、编辑 job.properties

nameNode=hdfs://hadoop102:8020
jobTracker=hadoop103:8032
queueName=default
examplesRoot=oozie-apps
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/shell

3、删除配置

[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -rm -r -f /user/djm/oozie-apps/

4、上传配置

[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -put oozie-apps/ /user/djm

5、执行任务

[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozie job -oozie http://hadoop102:11000/oozie -config oozie-apps/shell/job.properties -run

4.3 oozie 调度 MR

1、拷贝官方模板到 oozie-apps

[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ cp -r /opt/module/cdh/ oozie-4.0.0-cdh5.3.6/examples/apps/map-reduce/ oozie-apps/

2、编辑 workflow.xml

<workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf">
    <start to="mr-node"/>
    <action name="mr-node">
        <map-reduce>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${nameNode}/output/"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
                <!-- 配置调度MR任务时，使用新的API -->
                <property>
                    <name>mapred.mapper.new-api</name>
                    <value>true</value>
                </property>

                <property>
                    <name>mapred.reducer.new-api</name>
                    <value>true</value>
                </property>

                <!-- 指定Job Key输出类型 -->
                <property>
                    <name>mapreduce.job.output.key.class</name>
                    <value>org.apache.hadoop.io.Text</value>
                </property>

                <!-- 指定Job Value输出类型 -->
                <property>
                    <name>mapreduce.job.output.value.class</name>
                    <value>org.apache.hadoop.io.IntWritable</value>
                </property>

                <!-- 指定输入路径 -->
                <property>
                    <name>mapred.input.dir</name>
                    <value>/input/</value>
                </property>

                <!-- 指定输出路径 -->
                <property>
                    <name>mapred.output.dir</name>
                    <value>/output/</value>
                </property>

                <!-- 指定Map类 -->
                <property>
                    <name>mapreduce.job.map.class</name>
                    <value>org.apache.hadoop.examples.WordCount$TokenizerMapper</value>
                </property>

                <!-- 指定Reduce类 -->
                <property>
                    <name>mapreduce.job.reduce.class</name>
                    <value>org.apache.hadoop.examples.WordCount$IntSumReducer</value>
                </property>

                <property>
                    <name>mapred.map.tasks</name>
                    <value>1</value>
                </property>
            </configuration>
        </map-reduce>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

3、编辑 job.properties

jobTracker=hadoop103:8032
queueName=default
examplesRoot=oozie-apps
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/map-reduce/workflow.xml

4、拷贝待执行的jar包到map-reduce的lib目录下

[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ cp -a  /opt /module/cdh/hadoop-2.5.0-cdh5.3.6/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar oozie-apps/map-reduce/lib

3、删除配置

[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -rm -r -f /user/djm/oozie-apps/

4、上传配置

[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -put oozie-apps/ /user/djm

5、执行任务

[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozie job -oozie http://hadoop102:11000/oozie -config oozie-apps/map-reduce/job.properties -run

4.4 定时任务

1、检查是否安装了 ntp 服务

[root@hadoop102 ~]# rpm -qa | grep ntp

2、修改 /etc/ntp.conf

将
#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
修改为
restrict 192.168.10.0 mask 255.255.255.0 nomodify notrap
将
server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org iburst
server 2.centos.pool.ntp.org iburst
server 3.centos.pool.ntp.org iburst
修改为
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
添加
server 127.127.1.0
fudge 127.127.1.0 stratum 10

3、修改 /etc/sysconfig/ntpd

#同步硬件时间
SYNC_HWCLOCK=yes

4、重新启动 ntpd 服务

[root@hadoop102 ~]# systemctl restart ntpd

5、设置 ntpd 服务开机启动

[root@hadoop102 ~]# chkconfig ntpd on

6、在其他机器配置 10 分钟与时间服务器同步一次

[root@hadoop102 ~]# crontab -e
添加
*/10 * * * * /usr/sbin/ntpdate hadoop102

7、修改oozie-site.xml

<property>
    <name>oozie.processing.timezone</name>
    <value>GMT+0800</value>
</property>

8、重启 oozie

[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozied.sh stop
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozied.sh start

9、拷贝官方模板到 oozie-apps

[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ cp -r /opt/module/cdh/ oozie-4.0.0-cdh5.3.6/examples/apps/map-reduce/ oozie-apps/

10、修改 workflow.xml

<workflow-app xmlns="uri:oozie:workflow:0.5" name="one-op-wf">
<start to="shell-node"/>
  <action name="shell-node">
      <shell xmlns="uri:oozie:shell-action:0.2">
          <job-tracker>${jobTracker}</job-tracker>
          <name-node>${nameNode}</name-node>
          <configuration>
              <property>
                  <name>mapred.job.queue.name</name>
                  <value>${queueName}</value>
              </property>
          </configuration>
          <exec>p1.sh</exec>
          <file>/user/djm/oozie-apps/cron/p1.sh</file>
          <capture-output/>
      </shell>
      <ok to="end"/>
      <error to="fail"/>
  </action>
<kill name="fail">
    <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>

11、修改 coordinator.xml

<coordinator-app name="cron-coord" frequency="${coord:minutes(5)}" start="${start}" end="${end}" timezone="GMT+0800" xmlns="uri:oozie:coordinator:0.2">
<action>
	<workflow>
	    <app-path>${workflowAppUri}</app-path>
	    <configuration>
	        <property>
	            <name>jobTracker</name>
	            <value>${jobTracker}</value>
	        </property>
	        <property>
	            <name>nameNode</name>
	            <value>${nameNode}</value>
	        </property>
	        <property>
	            <name>queueName</name>
	            <value>${queueName}</value>
	        </property>
	    </configuration>
	</workflow>
</action>
</coordinator-app>

12、修改 job.properties

nameNode=hdfs://hadoop102:8020
jobTracker=hadoop103:8032
queueName=default
examplesRoot=oozie-apps
oozie.coord.application.path=${nameNode}/user/${user.name}/${examplesRoot}/cron
start=2019-09-26T17:00+0800
end=2019-09-30T17:00+0800
workflowAppUri=${nameNode}/user/${user.name}/${examplesRoot}/cron

13、创建并修改 p1.sh

[djm@hadoop102 cron]$ vim p1.sh
date >> /opt/module/p1.log

14、删除配置

[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -rm -r -f /user/djm/oozie-apps/

15、上传配置

[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -put oozie-apps/ /user/djm

16、执行任务

[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozie job -oozie http://hadoop102:11000/oozie -config oozie-apps/cron/job.properties -run

来源：https://my.oschina.net/hoary/blog/3111555

标签

bash

Oozie

ntp

MapReduce

Hadoop