Oozie job won't run if using PySpark in SparkAction

问题

I've encountered several examples of SparkAction jobs in Oozie, and most of them are in Java. I edit a little and run the example in Cloudera CDH Quickstart 5.4.0 (with Spark version 1.4.0).

workflow.xml

<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'>
    <start to='spark-node' />

    <action name='spark-node'>
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/spark"/>
            </prepare>
            <master>${master}</master>
        <mode>${mode}</mode>    
            <name>Spark-FileCopy</name>
            <class>org.apache.oozie.example.SparkFileCopy</class>
            <jar>${nameNode}/user/${wf:user()}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>
            <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/input-data/text/data.txt</arg>
            <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/spark</arg>
        </spark>
        <ok to="end" />
        <error to="fail" />
    </action>

    <kill name="fail">
        <message>Workflow failed, error
            message[${wf:errorMessage(wf:lastErrorNode())}]
        </message>
    </kill>
    <end name='end' />
</workflow-app>

job.properties

nameNode=hdfs://quickstart.cloudera:8020
jobTracker=quickstart.cloudera:8032
master=local[2]
mode=client
examplesRoot=examples
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/spark

The Oozie workflow example (in Java) was able to complete and do its task.

I've written a spark-submit job using Python / PySpark however. I tried removing <class> and for the jar

<jar>my_pyspark_job.py</jar>

but I get error in the logs when I attemp to run the Oozie-Spark job:

Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [2]

I wonder what should I be placing in <class> and <jar> tags if I'm using Python / PySpark?

回答1:

I too struggled a lot with the spark-action in oozie. I setup the sharelib properly and tried to pass the the appropriate jars using the --jars option within the <spark-opts> </spark-opts> tags, but to no avail.

I always ended up getting some error or the other. The most I could do was run all java/python spark jobs in local mode through the spark-action.

However, I got all my spark jobs running in oozie in all modes of execution using the shell action. The major problem with the shell action is that shell jobs are deployed as the 'yarn' user. If you happen to deploy your oozie spark job from a user account other than yarn, you'll end up with a Permission Denied error (because the user would not be able to access the spark assembly jar copied into /user/yarn/.SparkStaging directory). The way to solve this is to set the HADOOP_USER_NAME environment variable to the user account name through which you deploy your oozie workflow.

Below is a workflow that illustrates this configuration. I deploy my oozie workflows from the ambari-qa user.

<workflow-app xmlns="uri:oozie:workflow:0.4" name="sparkjob">
    <start to="spark-shell-node"/>
    <action name="spark-shell-node">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>oozie.launcher.mapred.job.queue.name</name>
                    <value>launcher2</value>
                </property>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>default</value>
                </property>
                <property>
                    <name>oozie.hive.defaults</name>
                    <value>/user/ambari-qa/sparkActionPython/hive-site.xml</value>
                </property>
            </configuration>
            <exec>/usr/hdp/current/spark-client/bin/spark-submit</exec>
            <argument>--master</argument>
            <argument>yarn-cluster</argument>
            <argument>wordcount.py</argument>
            <env-var>HADOOP_USER_NAME=ambari-qa</env-var>
            <file>/user/ambari-qa/sparkActionPython/wordcount.py#wordcount.py</file>
            <capture-output/>
        </shell>
        <ok to="end"/>
        <error to="spark-fail"/>
    </action>
    <kill name="spark-fail">
        <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

Hope this helps!

回答2:

You should try configure the Oozie Spark action to bring needed files locally. You can make it using a file tag:

<spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${resourceManager}</job-tracker>
        <name-node>${nameNode}</name-node>
        <master>local[2]</master>
        <mode>client</mode>
        <name>${name}</name>
        <jar>my_pyspark_job.py</jar>
        <file>{path to your file on hdfs}/my_pyspark_job.py#my_pyspark_job.py</file>
    </spark>

Explanation: Oozie action running inside YARN container which is allocated by YARN on the node which has available resources. Before running the action (which is actually a "driver" code) it copies all needed files (jars for example) locally to the node into folder allocated for YARN container to put its resources. So by adding tag to oozie action you "telling" your oozie action to bring the my_pyspark_job.py locally to the node of execution.

In my case I want to run a bash script (run-hive-partitioner.bash) which will run a python code (hive-generic-partitioner.py), so I need all files locally accessible on the node:

<action name="repair_hive_partitions">
  <shell xmlns="uri:oozie:shell-action:0.1">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <exec>${appPath}/run-hive-partitioner.bash</exec>
         <argument>${db}</argument>
         <argument>${tables}</argument>
         <argument>${base_working_dir}</argument>
    <file>${appPath}/run-hive-partitioner.bash#run-hive-partitioner.bash</file>
    <file>${appPath}/hive-generic-partitioner.py#hive-generic-partitioner.py</file>
     <file>${appPath}/util.py#util.py</file>     
  </shell>
  <ok to="end"/>
  <error to="kill"/>
</action>

where ${appPath} is hdfs://ci-base.com:8020/app/oozie/util/wf-repair_hive_partitions

so this is what I get in my job:

Files in current dir:/hadoop/yarn/local/usercache/hdfs/appcache/application_1440506439954_3906/container_1440506439954_3906_01_000002/

======================
File: hive-generic-partitioner.py
File: util.py
File: run-hive-partitioner.bash
...
File: job.xml
File: json-simple-1.1.jar
File: oozie-sharelib-oozie-4.1.0.2.2.4.2-2.jar
File: launch_container.sh
File: oozie-hadoop-utils-2.6.0.2.2.4.2-2.oozie-4.1.0.2.2.4.2-2.jar

As you can see it oozie (or actually yarn I think) shipped all needed files locally to the temp folder and now it's able to run it.

回答3:

I was able to "fix" this issue although it leads to another issue. Nonetheless, I will still post it.

In stderr of the Oozie container logs, it shows:

Error: Only local python files are supported

And I found a solution here

This is my initial workflow.xml:

    <spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${resourceManager}</job-tracker>
        <name-node>${nameNode}</name-node>
        <master>local[2]</master>
        <mode>client</mode>
        <name>${name}</name>
        <jar>my_pyspark_job.py</jar>
    </spark>

What I did initially was to copy to HDFS the Python script I wish to run as spark-submit job. It turns out that it expects the .py script in the local file system, so I what I did was to refer to the absolute local file system of my script.

<jar>/<absolute-local-path>/my_pyspark_job.py</jar>

回答4:

We were getting same error. If you try to drop spark-assembly jar from '/path/to/spark-install/lib/spark-assembly*.jar' (depends upon distribution) to your oozie.wf.application.path/lib dir along side your application jar it should work.

来源：https://stackoverflow.com/questions/31450828/oozie-job-wont-run-if-using-pyspark-in-sparkaction

标签

java

apache-spark

Oozie

pyspark

cloudera-quickstart-vm