1 Oozie 简介
一个基于工作流引擎的开源框架,提供对 Hadoop MapReduce、Pig Jobs 的任务调度与协调,主要用于定时调度任务,多任务可以按照执行的逻辑顺序调度。
2 功能模块
2.1 模块
1、Workflow
顺序执行流程节点,支持 fork(分支多个节点),join(合并多个节点为一个)
2、Coordinator
定时触发 workflow
3、Bundle
绑定多个 Coordinator
2.2 常用节点
- 控制流节点(Control Flow Nodes**)**
控制流节点一般都是定义在工作流开始或者结束的位置,比如start,end,kill 等,以及提供工作流的执行路径机制,如decision,fork,join 等。
- 动作节点(Action Nodes**)**
负责执行具体动作的节点,比如:拷贝文件,执行某个 Shell 脚本等等
3 安装部署
3.1 Hadoop
配置 core-site.xml
<configuration>
<!-- 指定HDFS中NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop101:8020</value>
</property>
<!-- 指定Hadoop运行时产生文件的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp</value>
</property>
</configuration>
配置 hadoop-env.sh
#修改JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_144
配置 hdfs-site.xml
<configuration>
<!-- 指定HDFS副本的数量 -->
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<!-- 指定Hadoop辅助名称节点主机配置 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop104:50090</value>
</property>
</configuration>
配置 yarn-env.sh
#修改JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_144
配置 yarn-site.xml
<configuration>
<!-- Reducer获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop103</value>
</property>
<!-- 日志聚集功能使能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 日志保留时间设置7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
</configuration>
配置 mapred-env.sh
#修改JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_144
配置 mapred-site.xml
<configuration>
<!-- 指定MR运行在YARN上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- 历史服务器端地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop104:10020</value>
</property>
<!-- 历史服务器web端地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop104:19888</value>
</property>
</configuration>
配置 salves
hadoop102
hadoop103
hadoop104
在集群上分发配置好的 Hadoop 配置文件
[djm@hadoop102 ~]$ xsync /opt/module/hadoop-2.7.2/etc/hadoop/
启动集群
[djm@hadoop102 hadoop-2.5.0-cdh5.3.6]$ sbin/start-dfs.sh
[djm@hadoop103 hadoop-2.5.0-cdh5.3.6]$ sbin/start-yarn.sh
[djm@hadoop102 hadoop-2.5.0-cdh5.3.6]$ sbin/mr-jobhistory-daemon.sh start historyserver
3.2 Oozie
解压 oozie
[djm@hadoop102 software]$ tar -zxvf /opt/software/cdh/oozie-4.0.0-cdh5.3.6.tar.gz -C /opt/module
在 oozie 根目录下解压 oozie-hadooplibs-4.0.0-cdh5.3.6.tar.gz
[atguigu@hadoop102 oozie-4.0.0-cdh5.3.6]$ tar -zxvf oozie-hadooplibs-4.0.0-cdh5.3.6.tar.gz -C ../
在 oozie 目录下创建 libext 目录
[atguigu@hadoop102 oozie-4.0.0-cdh5.3.6]$ mkdir libext/
拷贝依赖的 jar 包
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ cp -ra hadooplibs/hadooplib-2.5.0-cdh5.3.6.oozie-4.0.0-cdh5.3.6/* libext/
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ cp -a /opt/software/mysql-connector-java-5.1.27-bin.jar ./libext/
将 ext-2.2.zip 拷贝到 libext 目录下
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ cp -a /opt/software/cdh/ext-2.2.zip libext/
修改 oozie-site.xml
<configuration>
<property>
<name>oozie.service.JPAService.jdbc.driver</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.url</name>
<value>jdbc:mysql://hadoop102:3306/oozie</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.username</name>
<value>root</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.password</name>
<value>123456</value>
</property>
<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/opt/module/cdh/hadoop-2.5.0-cdh5.3.6/etc/hadoop</value>
</property>
</configuration>
初始化 oozie
#进入MySQL并创建oozie数据库:
create database oozie;
#上传Oozie目录下的yarn.tar.gz文件到HDFS:
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozie-setup.sh sharelib create -fs hdfs://hadoop102:8020 -locallib oozie-sharelib-4.0.0-cdh5.3.6-yarn.tar.gz
#创建oozie.sql文件
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/ooziedb.sh create -sqlfile oozie.sql -run
#打包项目,生成war包
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozie-setup.sh prepare-war
oozie 的启动与关闭
启动命令如下:
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozied.sh start
关闭命令如下:
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozied.sh stop
访问 Web 界面
4 实战案例
4.1 单节点工作流
1、创建工作目录
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ mkdir -p oozie-apps/shell
2、在 oozie-apps/shell 目录下创建 workflow.xml、job.properties
[djm@hadoop102 shell]$ touch workflow.xml
[djm@hadoop102 shell]$ touch job.properties
3、编辑 workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf">
<!--开始节点-->
<start to="shell-node"/>
<!--动作节点-->
<action name="shell-node">
<!--shell动作-->
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<!--要执行的脚本-->
<exec>mkdir</exec>
<argument>/opt/module/d</argument>
<capture-output/>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<!--kill节点-->
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<!--结束节点-->
<end name="end"/>
</workflow-app>
4、编辑 job.properties
#HDFS地址
nameNode=hdfs://hadoop102:8020
#ResourceManager地址
jobTracker=hadoop103:8032
#队列名称
queueName=default
examplesRoot=oozie-apps
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/shell
5、上传配置
[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -put oozie-apps/ /user/djm
6、执行任务
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozie job -oozie http://hadoop102:11000/oozie -config oozie-apps/shell/job.properties -run
4.2 多节点工作流
1、编辑 workflow.xml
<workflow-app
xmlns="uri:oozie:workflow:0.4" name="shell-wf">
<start to="p1-shell-node"/>
<action name="p1-shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>mkdir</exec>
<argument>/opt/module/d1</argument>
<capture-output/>
</shell>
<ok to="forking"/>
<error to="fail"/>
</action>
<fork name="forking">
<path start="p2-shell-node" />
<path start="p3-shell-node" />
</fork>
<action name="p2-shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>mkdir</exec>
<argument>/opt/module/d2</argument>
<capture-output/>
</shell>
<ok to="joining"/>
<error to="fail"/>
</action>
<action name="p3-shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>mkdir</exec>
<argument>/opt/module/d3</argument>
<capture-output/>
</shell>
<ok to="joining"/>
<error to="fail"/>
</action>
<join name="joining" to="p4-shell-node"/>
<action name="p4-shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>mkdir</exec>
<argument>/opt/module/d4</argument>
<capture-output/>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
2、编辑 job.properties
nameNode=hdfs://hadoop102:8020
jobTracker=hadoop103:8032
queueName=default
examplesRoot=oozie-apps
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/shell
3、删除配置
[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -rm -r -f /user/djm/oozie-apps/
4、上传配置
[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -put oozie-apps/ /user/djm
5、执行任务
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozie job -oozie http://hadoop102:11000/oozie -config oozie-apps/shell/job.properties -run
4.3 oozie 调度 MR
1、拷贝官方模板到 oozie-apps
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ cp -r /opt/module/cdh/ oozie-4.0.0-cdh5.3.6/examples/apps/map-reduce/ oozie-apps/
2、编辑 workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf">
<start to="mr-node"/>
<action name="mr-node">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/output/"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<!-- 配置调度MR任务时,使用新的API -->
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<!-- 指定Job Key输出类型 -->
<property>
<name>mapreduce.job.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<!-- 指定Job Value输出类型 -->
<property>
<name>mapreduce.job.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<!-- 指定输入路径 -->
<property>
<name>mapred.input.dir</name>
<value>/input/</value>
</property>
<!-- 指定输出路径 -->
<property>
<name>mapred.output.dir</name>
<value>/output/</value>
</property>
<!-- 指定Map类 -->
<property>
<name>mapreduce.job.map.class</name>
<value>org.apache.hadoop.examples.WordCount$TokenizerMapper</value>
</property>
<!-- 指定Reduce类 -->
<property>
<name>mapreduce.job.reduce.class</name>
<value>org.apache.hadoop.examples.WordCount$IntSumReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
3、编辑 job.properties
jobTracker=hadoop103:8032
queueName=default
examplesRoot=oozie-apps
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/map-reduce/workflow.xml
4、拷贝待执行的jar包到map-reduce的lib目录下
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ cp -a /opt /module/cdh/hadoop-2.5.0-cdh5.3.6/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar oozie-apps/map-reduce/lib
3、删除配置
[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -rm -r -f /user/djm/oozie-apps/
4、上传配置
[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -put oozie-apps/ /user/djm
5、执行任务
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozie job -oozie http://hadoop102:11000/oozie -config oozie-apps/map-reduce/job.properties -run
4.4 定时任务
1、检查是否安装了 ntp 服务
[root@hadoop102 ~]# rpm -qa | grep ntp
2、修改 /etc/ntp.conf
将
#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
修改为
restrict 192.168.10.0 mask 255.255.255.0 nomodify notrap
将
server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org iburst
server 2.centos.pool.ntp.org iburst
server 3.centos.pool.ntp.org iburst
修改为
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
添加
server 127.127.1.0
fudge 127.127.1.0 stratum 10
3、修改 /etc/sysconfig/ntpd
#同步硬件时间
SYNC_HWCLOCK=yes
4、重新启动 ntpd 服务
[root@hadoop102 ~]# systemctl restart ntpd
5、设置 ntpd 服务开机启动
[root@hadoop102 ~]# chkconfig ntpd on
6、在其他机器配置 10 分钟与时间服务器同步一次
[root@hadoop102 ~]# crontab -e
添加
*/10 * * * * /usr/sbin/ntpdate hadoop102
7、修改oozie-site.xml
<property>
<name>oozie.processing.timezone</name>
<value>GMT+0800</value>
</property>
8、重启 oozie
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozied.sh stop
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozied.sh start
9、拷贝官方模板到 oozie-apps
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ cp -r /opt/module/cdh/ oozie-4.0.0-cdh5.3.6/examples/apps/map-reduce/ oozie-apps/
10、修改 workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.5" name="one-op-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>p1.sh</exec>
<file>/user/djm/oozie-apps/cron/p1.sh</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
11、修改 coordinator.xml
<coordinator-app name="cron-coord" frequency="${coord:minutes(5)}" start="${start}" end="${end}" timezone="GMT+0800" xmlns="uri:oozie:coordinator:0.2">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
12、修改 job.properties
nameNode=hdfs://hadoop102:8020
jobTracker=hadoop103:8032
queueName=default
examplesRoot=oozie-apps
oozie.coord.application.path=${nameNode}/user/${user.name}/${examplesRoot}/cron
start=2019-09-26T17:00+0800
end=2019-09-30T17:00+0800
workflowAppUri=${nameNode}/user/${user.name}/${examplesRoot}/cron
13、创建并修改 p1.sh
[djm@hadoop102 cron]$ vim p1.sh
date >> /opt/module/p1.log
14、删除配置
[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -rm -r -f /user/djm/oozie-apps/
15、上传配置
[djm@hadoop102 shell]$ /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/bin/hadoop fs -put oozie-apps/ /user/djm
16、执行任务
[djm@hadoop102 oozie-4.0.0-cdh5.3.6]$ bin/oozie job -oozie http://hadoop102:11000/oozie -config oozie-apps/cron/job.properties -run