Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

穿精又带淫゛_ 提交于 2019-12-11 03:47:34

问题


I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop. I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.

Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.


回答1:


You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)

Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.

If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)




回答2:


I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.




回答3:


There are a variety of options for remotely performing spark-submit via Airflow.

  • Emr-Step
  • Apache-Livy (see this for hint)
  • SSH

Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.



来源:https://stackoverflow.com/questions/52013087/airflow-and-spark-hadoop-unique-cluster-or-one-for-airflow-and-other-for-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!