To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

亡梦爱人 提交于 2021-01-29 20:34:26

问题


I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster.

When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator.

Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko

回答1:


You can try using Livy In the following python example , my executable jar are on S3.

import json, requests
def spark_submit(master_dns):
        host = 'http://' + master_dns + ':8998'
        data = {"conf": {"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"},
                'file': "s3://<your driver jar>",
                "jars": ["s3://<dependency>.jar"]
        headers = {'Content-Type': 'application/json'}
        print("Calling request........")
        response = requests.post(host + '/batches', data=json.dumps(data), headers=headers)
        print(response.json())
        return response.headers

I am running the above code wrapped as a python operator from Airflow




回答2:


paramiko is a python library for performing ssh operations. You have to install paramiko to use SSH operator. Simply install the paramiko, command:- pip3 install paramiko.

let me know if you have any problem after installing paramiko.




回答3:


I got the connection and here is my code and procedure.

import airflow
from airflow import DAG
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta


dag = DAG(dag_id = "spk", description='filer',
          schedule_interval='* * * * *',
          start_date = airflow.utils.dates.days_ago(2),
          params={'project_source': '/home/afzal',
                  'spark_submit': '/usr/hdp/current/spark2-client/bin/spark-submit --principal hdfs-ivory@KDCAUTH.COM --keytab /etc/security/keytabs/hdfs.headless.keytab --master yarn --deploy-mode client airpy.py'})

templated_bash_command = """
            cd {{ params.project_source }}
            {{ params.spark_submit }} 
            """

t1 = SSHOperator(
       task_id="SSH_task",
       ssh_conn_id='spark_21',
       command=templated_bash_command,
       dag=dag
       )

and I also created a connection in 'Admin > Connections' in airflow

Conn Id : spark_21
Conn Type : SSH
Host : mas****p
Username : afzal
Password : ***** 
Port  :
Extra  :

The username and password is used to login to the desired cluster.



来源:https://stackoverflow.com/questions/59552586/to-run-spark-submit-programs-from-a-different-cluster-1-1-0-21-in-airflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!