问题
I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop. I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
回答1:
You really only need to configure a yarn-site.xml
file, I believe, in order for spark-submit --master yarn --deploy-mode client
to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml
and hive-site.xml
to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
回答2:
I prefer submitting Spark Jobs using SSHOperator and running spark-submit
command which would save you from copy/pasting yarn-site.xml
. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor
should be fine.
回答3:
There are a variety of options for remotely performing spark-submit
via Airflow
.
- Emr-Step
- Apache-Livy (see this for hint)
- SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.
来源:https://stackoverflow.com/questions/52013087/airflow-and-spark-hadoop-unique-cluster-or-one-for-airflow-and-other-for-spark