Oozie

Oozie的工作流调度

匿名 (未验证) 提交于 2019-12-02 23:32:01
设置oozie工作流调度 workflow Coordinator bundle workflow.xml job.properties内容: nameNode=hdfs://hadoop01:9000 --客户端连接hdfs集群 jobTracker=hadoop01:8032 --客户端连接yarn集群 queueName=default --调度队列 filePath=/gp1819/oozie --oozie的根目录 oozie.libpath=${nameNode}/gp1819/oozielib --第三方依赖路径 oozie.wf.application.path= n a m e N o d e {nameNode} n a m e N o d e {filePath}/sqoop/ --工作流应用程序的目录 Coordinator.xml hdfs dfs -mkdir /gp1919 hdfs dfs -mkdir -p /gp1919/oozie /gp1919/oozielib hdfs dfs -put $HIVE_HOME/lib/mysql-connector-java-5.1.32.jar /gp1919/oozielib/ 1、创建sqoop的job vi gp1919_sqoop_desc.sh 2、检查oozie任务的配置是否正确

Observing duplicates using sqoop with Oozie

时光怂恿深爱的人放手 提交于 2019-12-02 17:45:55
问题 I've built a sqoop pogram in order to import data from MySQL to HDFS using a pre-built sqoop job: sqoop job -fs $driver_path -D mapreduce.map.java.opts=" -Duser.timezone=Europe/Paris"\ --create job_parquet_table -- import -m $nodes_number\ --connect jdbc:mysql://$server:$port/$database --username $username --password-file $pass_file\ --target-dir $destination_dir --table $table --as-parquetfile --append\ --incremental append --check-column $id_column_names --last-value 1\ --fields-terminated

oozie使用中的一些小结(持续完善)

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-02 14:55:48
0 关于oozie 寻找包寻找位置原则: oozie在运行的时候 只会去两个地方寻找自己需要的lib 1 回去当前提交任务的workflow所在的hdfs目录下的lib下寻找 eg: /user/root/examples/apps/fork-merge的workflow下有 job.properties lib workflow.xml三个目录 会去lib目录下找对应jar 2 如果是shell命令提交的话,他会主动去自己的公共资源库中寻找自己需要的jar文件,公共资源库为 /user/root/share/lib/lib_20150128185329 其中共享库里面存放的是oozie 常见action需要的包 比如hive hive2 pig sqoop oozie hcatalog distcp等 如果是java客户端提交任务的话,需要设置oozie.libpath(此时此路径下可以存放你工程需要的别的jar包而不需要存放在共享库中 防止混淆) properties.setProperty("oozie.use.system.libpath","true"); ---> 设置使用oozie共享库 properties.setProperty("oozie.libpath","hdfs://master:9000/user/hdfs/examples/thirdlib"); -

驭象者之Apache Oozie

半世苍凉 提交于 2019-12-02 14:55:08
(1)Apache Oozie是什么? Oozie在英语中的释义指的是:驯象人,驭象者(多指缅甸那边的俗称),这个比喻相对与它的功能来说,还是很恰当的。 Apache Oozie是一个用来管理Hadoop任务的工作流调度系统,是基于有向无环图的模型(DAG)。Oozie支持大多数的Hadoop任务的组合,常见的有Java MapReduce,Streaming map-reduce,Pig,Hive, Sqoop , Distcp,也可以结合一些脚本如Shell,Python,Java来很灵活的完成一些事情。同时,它也是一个可伸缩的,可扩展,高可靠的的系统 (2)Apache Oozie能用来干什么? 其实,上面的这张图,已经足够回答这个问题了,工作流嘛,顾名思义,就是我要干一件事,需要很多步骤,然后有序组合,最终达到能够完成这件事的目的。 举个例子,就拿做饭这件事吧。 1,买菜 2,洗菜 3,切菜 4,炒菜 5,上菜 这是一个简单的流程,当然这里面会有很多其他的小细节,比如我买菜,去了不同的菜市场,炒菜时候,又临时去买了一些调料,等等。 仔细分析这里面的道道,有些是有依赖关系的,有些没依赖关系的,比如菜是核心,所有很菜有关的都有先后顺序,其他的辅助步骤,比如说烧水,跟这是没有依赖关系的。反应到实际工作中的一些任务也是如此,所以采用oozie来管理调度,还是很方便的一件事。 (3

DAG(directed acyclic graph) dynamic job scheduler

无人久伴 提交于 2019-12-02 14:08:46
I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should have the ability to restart a failed workflow branch without waiting for whole workflow to finish execution. Are there any frameworks in python that can handle this? I see several core functions: DAG bulding Execution of nodes (run shell cmd with wait,logging etc.) Ability to rebuild sub-graph in parent DAG during execution Ability to manual execute nodes or sub-graph while parent graph is running

Hue 入门

 ̄綄美尐妖づ 提交于 2019-12-02 10:31:41
1 简介 Hue 是什么? Hue=Hadoop User Experience(Hadoop 用户体验),直白来说就一个开源的 Apache Hadoop UI 系统,它是基于Python Web 框架 Django 实现的,通过使用 Hue 我们可以在浏览器端的 Web 控制台上与 Hadoop 集群进行交互来分析处理数据。 2 安装部署 2.1、帮助文档 http://archive.cloudera.com/cdh5/cdh/5/hue-3.7.0-cdh5.3.0/manual.html 2.2、Hue 安装 1.安装前准备 必备的软件环境: Centos 7.6+Python 2.7.5+JDK8+Maven-3.3.9+Ant-1.8.1+Hue-3.7.0 必备的集群环境: Hadoop+HBase+Hive+ZK+MySQL+Oozie 配置环境变量 #JAVA_HOME export JAVA_HOME = /opt/module/jdk1.8.0_144 export PATH = $PATH : $JAVA_HOME /bin #MAVEN_HOME export MAVEN_HOME = /opt/module/maven-3.3.9 export PATH = $PATH : $MAVEN_HOME /bin #HADOOP_HOME export

Oozie supress logging from shell job action?

杀马特。学长 韩版系。学妹 提交于 2019-12-02 07:32:05
I have a simple workflow (see below) which runs a shell script. The shell script runs pyspark script, which moves file from local to hdfs folder. When I run the shell script itself, it works perfectly, logs are redirect to a folder by > spark.txt 2>&1 right in the shell script. But when I submit oozie job with following workflow, output from shell seems to be supressed. I tried to redirect all possible oozie logs (-verbose -log) > oozie.txt 2>&1, but it didn't help. The workflow is finished successfuly (status SUCCESSEDED, no error log), but I see, the folder is not copied to hdfs, however

Flink整合oozie shell Action 提交任务 带kerberos认证

房东的猫 提交于 2019-12-02 03:03:00
最近这段时间一直在忙新集群迁移,上了最新的cdh6.3.0 于是Flink 提交遇到了许多的问题 还好有cloudera License 有了原厂的帮助和社区的伙伴,问题解决起来快了不少,手动滑稽 集群具体情况是,cdh6.3.0+Flink1.8.1,整个数据平台全部组件都上了kerberos和ldap因为要过认证,所以任务提交方法我们选择统一oozie提交任务 并且因为kerberos认证,还需要Flink perjob 需要单独的keytab,才能细腻度的控制权限,因为我们现在部门之间计算资源的划分是通过yarn资源队列 但是现在Flink支持的不是很好,目前只能在配置文件中配置一个keytab,job启动都去这个拉这个keytab复制到自己的contain里面 但是Flink第一提交方式还是希望能够通过oozie提交job 由于oozie没有天生支持Flink提交,所以只能选择oozie shell action 的方式提交job 在Flink搭建好以后开始提交任务,用oozie shell提交 #!/bin/bash flink run -m yarn-cluster flinktest.jar 马上 Duang flink command not find 改成命令绝对路径以后! 还是 Duang org.apache.flink.client.deployment

Distcp - Container is running beyond physical memory limits

别来无恙 提交于 2019-12-02 02:34:41
问题 I've been strugling with distcp for several days and I swear I have googled enough. Here is my use-case: USE CASE I have a main folder in a certain location say /hdfs/root , with a lot of subdirs (deepness is not fixed) and files. Volume: 200,000 files ~= 30 GO I need to copy only a subset for a client, /hdfs/root in another location, say /hdfs/dest This subset is defined by a list of absolute path that can be updated over time. Volume: 50,000 files ~= 5 GO You understand that I can't use a

Distcp - Container is running beyond physical memory limits

只谈情不闲聊 提交于 2019-12-02 02:34:01
I've been strugling with distcp for several days and I swear I have googled enough. Here is my use-case: USE CASE I have a main folder in a certain location say /hdfs/root , with a lot of subdirs (deepness is not fixed) and files. Volume: 200,000 files ~= 30 GO I need to copy only a subset for a client, /hdfs/root in another location, say /hdfs/dest This subset is defined by a list of absolute path that can be updated over time. Volume: 50,000 files ~= 5 GO You understand that I can't use a simple hdfs dfs -cp /hdfs/root /hdfs dest because it is not optimized, it will take every files, and it