Oozie

Scheduling spark jobs on a timely basis

筅森魡賤 提交于 2019-12-04 18:22:43
Which is the recommended tool for scheduling Spark Jobs on a daily/weekly basis. 1) Oozie 2) Luigi 3) Azkaban 4) Chronos 5) Airflow Thanks in advance. Joe Harris Updating my previous answer from here: Suggestion for scheduling tool(s) for building hadoop based data pipelines Airflow: Try this first. Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird. Airflow has built in support for the fact that jobs scheduled jobs often need to be rerun and/or backfilled. Make sure you build your pipelines to support this. Azkaban: Nice UI,

How to auto rerun of failed action in oozie?

假如想象 提交于 2019-12-04 14:57:36
How can I re-run any action which was failed in the workflow automatically? I know the way to rerun manually from command line or thorough hue. $oozie job -rerun ... Is there any parameter we can set or provide in workflow to retry automatically when action fails? Most of the time, when an action fails in the Oozie workflow, you need to debug and fix the error and rerun the workflow. But there are times, when you want Oozie to retry the action after an interval, for fixed number of times before failing the workflow. You can specify the retry-max and retry-interval in the action definition.

Can there be two oozie workflow.xml files in one directory?

一曲冷凌霜 提交于 2019-12-04 14:02:04
Can there be two oozie workflow.xml files in one directory? If so how can I instruct oozie runner which one to run? You can have two workflow files (just give them unique names), then you can select which one to call by setting the oozie.wf.application.path value in your config file: oozie.wf.application.path=hdfs://namenode:9000/path/to/job/wf-1.xml #oozie.wf.application.path=hdfs://namenode:9000/path/to/job/wf-2.xml Use 2 different directories. But if you need to call the second workflow file as a sub-workflow just give it a different name. Here is how I call a sub workflow: I have 2 files

Tables created by oozie hive action cannot be found from hive client but can find them in HDFS

笑着哭i 提交于 2019-12-04 11:56:31
I'm trying to run hive script via Oozie Hive Action, I just created a hive table 'test' in my script.q , and the oozie job ran successed, I can find the table created by oozie job under hdfs path /user/hive/warehouse. But I could not find the 'test' table via command "show tables" in Hive Client. I think there is something wrong with my metastore config, but I just can't figure it out. Can somebody help ? oozie admin -oozie http://localhost:11000/oozie -status System mode: NORMAL oozie job -oozie http://localhost:11000/oozie -config C:\Hadoop\oozie-3.2.0-incubating\oozie-win-distro\examples

Handling loops in oozie workflow

二次信任 提交于 2019-12-04 11:18:11
I have an oozie use case for checking input data availability and trigger mapreduce job based on availability of data. So I wrote a shell script for checking input data and created an ssh action for it in oozie, The number of retries and and retry intervals of Input data checking should be configurable and after each retry if the data is still missing I got to send an alert, after specified number of retries mapreduce job can start with the available data I wrote a workflow as follows : <start to="datacheck" /> <action name="datacheck"> <ssh xmlns="uri:oozie:ssh-action:0.1"> <host>$

how to use logical operators in OOZIE workflow

大兔子大兔子 提交于 2019-12-04 09:22:40
i have a oozie workflow im using decision control node in the predicate i want to "&&" two different conditions and i need to use "&&" in between them for the final TRUE/FALSE result i dont find the predicate syntax for such conditions im using this <decision name="comboDecision"> <switch> <case to="alpha"> --------- </case> </switch> </decision> i want to do this = <decision name="comboDecision"> <switch> <case to="alpha"> condition1 && condition2 </case> </switch> </decision> can anyone help me with the syntax ? I will explain this with an example. Let's assume that we have a Java action (we

Download file weekly from FTP to HDFS

被刻印的时光 ゝ 提交于 2019-12-04 07:46:06
I want to automate the weekly download of a file from an ftp server into a CDH5 hadoop cluster. What would be the best way to do this? I was thinking about an Oozie coordinator job but I can't think of a good method to download the file. Since you're using CDH5, it's worth noting that the NFSv3 interface to HDFS is included in that Hadoop distribution. You should check for " Configuring an NFSv3 Gateway " in the CDH5 Installation Guide documentation. Once that's done, you could use wget, curl, python, etc. to put the file onto the NFS mount. You probably want to do this through Oozie ... go

building oozie: Unknown host repository.codehaus.org

佐手、 提交于 2019-12-04 07:29:26
I'm trying to build Oozie 4.2.0 downloaded from here: http://ftp.cixug.es/apache/oozie/4.2.0/oozie-4.2.0.tar.gz After launching the build bin/mkdistro.sh -DskipTests I'm getting this error: [ERROR] Failed to execute goal on project oozie-core: Could not resolve dependencies for project org.apache.oozie:oozie-core:jar:4.2.0: Could not transfer artifact org.apache.hbase:hbase:jar:1.1.1 from/to Codehaus repository (http://repository.codehaus.org/): Unknown host repository.codehaus.org From what I'm seeing on the Internet, codehause repository is not available any more. Is there a way to build

Oozie supress logging from shell job action?

荒凉一梦 提交于 2019-12-04 05:33:59
问题 I have a simple workflow (see below) which runs a shell script. The shell script runs pyspark script, which moves file from local to hdfs folder. When I run the shell script itself, it works perfectly, logs are redirect to a folder by > spark.txt 2>&1 right in the shell script. But when I submit oozie job with following workflow, output from shell seems to be supressed. I tried to redirect all possible oozie logs (-verbose -log) > oozie.txt 2>&1, but it didn't help. The workflow is finished

#数据技术选型#即席查询Shib+Presto,集群任务调度HUE+Oozie

非 Y 不嫁゛ 提交于 2019-12-04 05:09:37
郑昀 创建于2014/10/30 最后更新于2014/10/31 一)选型:Shib+Presto 应用场景:即席查询(Ad-hoc Query) 1.1.即席查询的目标 使用者是产品/运营/销售运营的数据分析师; 要求数据分析师掌握查询SQL查询脚本编写技巧,掌握不同业务的数据存储在不同的数据集市里; 不管他们的计算任务是提交给 数据库 还是 Hadoop,计算时间都可能会很长,不可能在线等待; 所以, 使用者提交了一个计算任务(PIG/SQL/Hive SQL),控制台告知任务已排队,给出大致的计算时间等友情提示, 这些作业的权重较低, 使用者和管理员可以查看排队中的计算任务,包括已执行任务的执行时间、运行时长和运行结果; 当计算任务有结果后,控制台界面有通知提示,或者发邮件提示,使用者可以在线查看和下载数据。 1.2.即席查询的当下技术选型 图形交互界面:Shib; 数据查询引擎:Facebook Presto。 1.3.为什么要更换数据查询引擎? 基于 MapReduce 的 Hadoop 适合数据批处理,但不适合即席查询场景。基于 InnoDB/MyISAM 存储引擎的 MySQL 自然也不适合。当然我们也观察过 InfiniDB/InfoBright 这种列式存储数据库引擎(仍基于MySQL),它们更适合基本不再变更的历史 归档数据,所以不太适合电商应用场景。