Airflow stops following Spark job submitted over SSH

六眼飞鱼酱① 提交于 2019-12-07 16:15:04

问题


I'm using Apache Airflow standalone to submit my Spark jobs with SSHExecutorOperator to connect to the edge node and submit jobs with a simple BashCommand.

It's mostly working well but sometimes some random tasks are running undefinitly.

My job succeeds, but is still running according to Airflow.

When I check the logs, it's like Airflow has stopped following the job as if it didn't get the return value.

Why could this happen? Some jobs run for 10h+ and Airflow watches them successfully, while others fail.

I have only Spark's logs (at INFO level) without anything printed by the job driver.

It doesn't depend on the deploy mode. I used both client and cluster and it doesn't seem to depend on that. Sometimes, Airflow fails to watch some simple Python script.

To solve this issue, I was wondering if installing this plugin could work.

EDIT :

I'm using Airflow 1.8.

I didnt install SparkSubmitOperator because : "The executors need to have access to the spark-submit command on the local commandline shell. Spark libraries will need to be installed.".

My airflow is just a VM with no hadoop binaries. Airflow do some SSH connection then submit on Edge node.

When I look SparkSubmitOperator documentation, I don't think I can connect to edge node to submit. There is no "conn_id" or SSH parameter.

PS2 : This morning, a job was running all night long (even if he supposed to run in 30min ... ). I use netstat to check if there my airflow application user connect with SSH and ... nothing, SSH connection died imo.

Same task, same DAG, different RUN :

OK :

[2018-07-05 10:48:55,509] {base_task_runner.py:95} INFO - Subtask: [2018-07-05 10:48:55,509] {ssh_execute_operator.py:146} INFO - 18/07/05 10:48:55 INFO datasources.FileFormatWriter: Job null committed. [2018-07-05 10:48:55,510] {base_task_runner.py:95} INFO - Subtask: [2018-07-05 10:48:55,510] {ssh_execute_operator.py:146} INFO - 18/07/05 10:48:55 INFO datasources.FileFormatWriter: Finished processing stats for job null. [2018-07-05 10:49:08,407] {jobs.py:2083} INFO - Task exited with return code 0

FAIL :

[2018-07-04 18:52:12,849] {base_task_runner.py:95} INFO - Subtask: [2018-07-04 18:52:12,849] {ssh_execute_operator.py:146} INFO - 18/07/04 18:52:12 INFO scheduler.DAGScheduler: Job 5 finished: json at CleaningTweets.scala:249, took 8.411721 s [2018-07-04 18:52:13,530] {base_task_runner.py:95} INFO - Subtask: [2018-07-04 18:52:13,530] {ssh_execute_operator.py:146} INFO - 18/07/04 18:52:13 INFO datasources.FileFormatWriter: Job null committed. [2018-07-04 18:52:13,531] {base_task_runner.py:95} INFO - Subtask: [2018-07-04 18:52:13,530] {ssh_execute_operator.py:146} INFO - 18/07/04 18:52:13 INFO datasources.FileFormatWriter: Finished processing stats for job null.

Miss return ...

LAST EDIT : I removed everylogs (print/show) of every jobs, and it seems working.

来源:https://stackoverflow.com/questions/51177802/airflow-stops-following-spark-job-submitted-over-ssh

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!