How to submit Spark jobs to EMR cluster from Airflow?

女生的网名这么多〃 提交于 2019-11-26 04:51:17

问题


How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.

I need solutions so that Airflow can talk to EMR and execute Spark submit.

https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/

These blogs have understanding on execution after connection has been established.(Didn\'t help much)

In airflow I have made a connection using UI for AWS and EMR:-

Below is the code which will list the EMR cluster\'s which are Active and Terminated, I can also fine tune to get Active Clusters:-

from airflow.contrib.hooks.aws_hook import AwsHook
import boto3
hook = AwsHook(aws_conn_id=‘aws_default’)
    client = hook.get_client_type(‘emr’, ‘eu-central-1’)
    for x in a:
        print(x[‘Status’][‘State’],x[‘Name’])

My question is - How can I update my above code can do Spark-submit actions


回答1:


While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow

  1. Use Apache Livy

    • This solution is actually independent of remote server, i.e., EMR
    • Here's an example
    • The downside is that Livy is in early stages and its API appears incomplete and wonky to me
  2. Use EmrSteps API

    • Dependent on remote system: EMR
    • Robust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)
    • On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)
  3. Use SSHHook / SSHOperator

    • Again independent of remote system
    • Comparatively easier to get started with
    • If your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersome

EDIT-1

There seems to be another straightforward way

  1. Specifying remote master-IP

    • Independent of remote system
    • Needs modifying Global Configurations / Environment Variables
    • See @cricket_007's answer for details

Useful links

  • This one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running master
  • Spark job submission using Airflow by submitting batch POST method on Livy and tracking job
  • Remote spark-submit to YARN running on EMR



回答2:


As you have created EMR using Terraform, then you get the master IP as aws_emr_cluster.my-emr.master_public_dns

Hope this helps.



来源:https://stackoverflow.com/questions/54022129/how-to-submit-spark-jobs-to-emr-cluster-from-airflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!