问题
How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.
I need solutions so that Airflow can talk to EMR and execute Spark submit.
https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/
These blogs have understanding on execution after connection has been established.(Didn\'t help much)
In airflow I have made a connection using UI for AWS and EMR:-
Below is the code which will list the EMR cluster\'s which are Active and Terminated, I can also fine tune to get Active Clusters:-
from airflow.contrib.hooks.aws_hook import AwsHook
import boto3
hook = AwsHook(aws_conn_id=‘aws_default’)
client = hook.get_client_type(‘emr’, ‘eu-central-1’)
for x in a:
print(x[‘Status’][‘State’],x[‘Name’])
My question is - How can I update my above code can do Spark-submit actions
回答1:
While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit
on (remote) EMR
via Airflow
Use
Apache Livy
- This solution is actually independent of remote server, i.e.,
EMR
- Here's an example
- The downside is that
Livy
is in early stages and itsAPI
appears incomplete and wonky to me
- This solution is actually independent of remote server, i.e.,
Use
EmrSteps
API
- Dependent on remote system:
EMR
- Robust, but since it is inherently async, you will also need an
EmrStepSensor
(alongsideEmrAddStepsOperator
) - On a single
EMR
cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)
- Dependent on remote system:
Use
SSHHook
/SSHOperator
- Again independent of remote system
- Comparatively easier to get started with
- If your
spark-submit
command involves a lot of arguments, building that command (programmatically) can become cumbersome
EDIT-1
There seems to be another straightforward way
Specifying remote
master
-IP- Independent of remote system
- Needs modifying Global Configurations / Environment Variables
- See @cricket_007's answer for details
Useful links
- This one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running master
- Spark job submission using Airflow by submitting batch POST method on Livy and tracking job
- Remote spark-submit to YARN running on EMR
回答2:
As you have created EMR using Terraform, then you get the master IP as aws_emr_cluster.my-emr.master_public_dns
Hope this helps.
来源:https://stackoverflow.com/questions/54022129/how-to-submit-spark-jobs-to-emr-cluster-from-airflow