问题
I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
What is best way to track Spark job using Airflow if even I submitted?
回答1:
My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:
- Specifying remote
masterIP: Requires modifying global configurations / environment variables - Using
SSHOperator:SSHconnection might break - Using
EmrAddStepsOperator: Dependent onEMR
Regarding tracking
Livyonly reportsstateand not progress (% completion of stages)- If your'e OK with that, you can just poll the
Livyserver viaRESTAPI and keep printing logs in console, those will appear on task logs in WebUI (View Logs)
Other considerations
Livydoesn't support reusingSparkSessionforPOST/batchesrequest- If that's imperative, you'll have to write your application code in
PySparkand usePOST/sessionrequests
References
- How to submit Spark jobs to EMR cluster from Airflow?
- livy/examples/pi_app
- rssanders3/livy_spark_operator_python_example
Useful links
- How to submit Spark jobs to EMR cluster from Airflow?
- Remote spark-submit to YARN running on EMR
来源:https://stackoverflow.com/questions/54228651/spark-job-submission-using-airflow-by-submitting-batch-post-method-on-livy-and-t