Oozie

Oozie job won't run if using PySpark in SparkAction

岁酱吖の 提交于 2019-12-06 13:38:53
问题 I've encountered several examples of SparkAction jobs in Oozie, and most of them are in Java. I edit a little and run the example in Cloudera CDH Quickstart 5.4.0 (with Spark version 1.4.0). workflow.xml <workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'> <start to='spark-node' /> <action name='spark-node'> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${nameNode}/user/${wf:user()}/

Python subprocess with oozie

让人想犯罪 __ 提交于 2019-12-06 13:37:32
I'm trying to use subprocess in a python script which I call within an oozie shell action. Subprocess is supposed to read a file which is stored in Hadoop's HDFS. I'm using hadoop-1.2.1 in pseudo-distributed mode and oozie-3.3.2. Here is the python script, named connected_subprocess.py : #!/usr/bin/python import subprocess import networkx as nx liste=subprocess.check_output("hadoop fs -cat /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n') G=nx.DiGraph() f=open("/home/rlk/liste_strongly_connected.txt","wb") for item in liste: try: app1,app2=item.split('\t') G.add_edge(app1

Suggestion for scheduling tool(s) for building hadoop based data pipelines

允我心安 提交于 2019-12-06 12:44:05
Between Apache Oozie, Spotify/Luigi and airbnb/airflow , what are the pros and cons for each of them? I have used oozie and airflow in the past for building a data ingestion pipeline using PIG and Hive. Currently, I am in the process of building a pipeline that looks at logs and extracts out useful events and puts them on redshift. I found that airflow was much easier to use/test/setup. It has a much cooler UI and lets users perform actions from the UI itself, which is not the case with Oozie. Any information about Luigi or other insights regarding stability and issues are welcome. Azkaban:

Sqoop Free-Form Query Causing Unrecognized Arguments in Hue/Oozie

蓝咒 提交于 2019-12-06 11:21:21
I am attempting to run a sqoop command with a free-form query, because I need to perform an aggregation. It's being submitted via the Hue interface, as an Oozie workflow. The following is a scaled-down version of the command and query. When the command is processed, the "--query" statement (enclosed in quotes) results in each portion of the query to be interpreted as unrecognized arguments, as shown in the error following the command. In addition, the target directory is being misinterpreted. What is preventing this from running, and what can be done to resolve it? The ${env} and ${shard}

How to auto rerun of failed action in oozie?

北战南征 提交于 2019-12-06 08:58:41
问题 How can I re-run any action which was failed in the workflow automatically? I know the way to rerun manually from command line or thorough hue. $oozie job -rerun ... Is there any parameter we can set or provide in workflow to retry automatically when action fails? 回答1: Most of the time, when an action fails in the Oozie workflow, you need to debug and fix the error and rerun the workflow. But there are times, when you want Oozie to retry the action after an interval, for fixed number of times

Tables created by oozie hive action cannot be found from hive client but can find them in HDFS

↘锁芯ラ 提交于 2019-12-06 08:02:47
问题 I'm trying to run hive script via Oozie Hive Action, I just created a hive table 'test' in my script.q , and the oozie job ran successed, I can find the table created by oozie job under hdfs path /user/hive/warehouse. But I could not find the 'test' table via command "show tables" in Hive Client. I think there is something wrong with my metastore config, but I just can't figure it out. Can somebody help ? oozie admin -oozie http://localhost:11000/oozie -status System mode: NORMAL oozie job

Handling loops in oozie workflow

≡放荡痞女 提交于 2019-12-06 05:48:26
问题 I have an oozie use case for checking input data availability and trigger mapreduce job based on availability of data. So I wrote a shell script for checking input data and created an ssh action for it in oozie, The number of retries and and retry intervals of Input data checking should be configurable and after each retry if the data is still missing I got to send an alert, after specified number of retries mapreduce job can start with the available data I wrote a workflow as follows :

E0701 XML schema error in OOZIE workflow

跟風遠走 提交于 2019-12-06 05:00:56
The following is my workflow.xml <workflow-app xmlns="uri:oozie:workflow:0.3" name="import-job"> <start to="createtimelinetable" /> <action name="createtimelinetable"> <sqoop xmlns="uri:oozie:sqoop-action:0.3"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.compress.map.output</name> <value>true</value> </property> </configuration> <command>import --connect jdbc:mysql://10.65.220.75:3306/automation --table ABC --username root</command> </sqoop> <ok to="end"/> <error to="end"/> </action> <end name="end"/> </workflow-app>

How to force coordinator action materialization at specific frequency?

百般思念 提交于 2019-12-06 03:52:32
I would like to know if it is possible/how to force a coordinator to materialize or instantiate workflow at regular intervals even if previous instantiated workflow are not done yet. Let me explain: I have a simple coordinator looking like this: <coordinator-app name="myApp" frequency="${coord:hours(3)}" start="2015-01-01T0:00Z" end="2016-01-01T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.4"> <action> <workflow> <app-path>${myPath}/workflow.xml</app-path> </workflow> </action> </coordinator-app> The frequency is set to 3 hours. Every 3 hours, I expect the coordinator to "materialize"

Add Spark to Oozie shared lib

拜拜、爱过 提交于 2019-12-06 03:43:27
By default, Oozie shared lib directory provides libraries for Hive, Pig, and Map-Reduce. If I want to run Spark job on Oozie, it might be better to add Spark lib jars to Oozie's shared lib instead of copy them to app's lib directory. How can I add Spark lib jars (including spark-core and its dependencies) to Oozie's shared lib? Any comment / answer is appreciated. Spark action is scheduled to be released with Oozie 4.2.0, even though the doc seems to be a bit behind. See related JIRA here : Oozie JIRA - Add spark action executor Cloudera's release CDH 5.4 has it already though, see official