问题
I'm using Apache Airflow standalone to submit my Spark jobs with SSHExecutorOperator to connect to the edge node and submit jobs with a simple BashCommand
.
It's mostly working well but sometimes some random tasks are running undefinitly.
My job succeeds, but is still running according to Airflow.
When I check the logs, it's like Airflow has stopped following the job as if it didn't get the return value.
Why could this happen? Some jobs run for 10h+ and Airflow watches them successfully, while others fail.
I have only Spark's logs (at INFO
level) without anything printed by the job driver.
It doesn't depend on the deploy mode. I used both client
and cluster
and it doesn't seem to depend on that. Sometimes, Airflow fails to watch some simple Python script.
To solve this issue, I was wondering if installing this plugin could work.
EDIT :
I'm using Airflow 1.8.
I didnt install SparkSubmitOperator because : "The executors need to have access to the spark-submit command on the local commandline shell. Spark libraries will need to be installed.".
My airflow is just a VM with no hadoop binaries. Airflow do some SSH connection then submit on Edge node.
When I look SparkSubmitOperator documentation, I don't think I can connect to edge node to submit. There is no "conn_id" or SSH parameter.
PS2 : This morning, a job was running all night long (even if he supposed to run in 30min ... ). I use netstat to check if there my airflow application user connect with SSH and ... nothing, SSH connection died imo.
Same task, same DAG, different RUN :
OK :
[2018-07-05 10:48:55,509] {base_task_runner.py:95} INFO - Subtask: [2018-07-05 10:48:55,509] {ssh_execute_operator.py:146} INFO - 18/07/05 10:48:55 INFO datasources.FileFormatWriter: Job null committed. [2018-07-05 10:48:55,510] {base_task_runner.py:95} INFO - Subtask: [2018-07-05 10:48:55,510] {ssh_execute_operator.py:146} INFO - 18/07/05 10:48:55 INFO datasources.FileFormatWriter: Finished processing stats for job null. [2018-07-05 10:49:08,407] {jobs.py:2083} INFO - Task exited with return code 0
FAIL :
[2018-07-04 18:52:12,849] {base_task_runner.py:95} INFO - Subtask: [2018-07-04 18:52:12,849] {ssh_execute_operator.py:146} INFO - 18/07/04 18:52:12 INFO scheduler.DAGScheduler: Job 5 finished: json at CleaningTweets.scala:249, took 8.411721 s [2018-07-04 18:52:13,530] {base_task_runner.py:95} INFO - Subtask: [2018-07-04 18:52:13,530] {ssh_execute_operator.py:146} INFO - 18/07/04 18:52:13 INFO datasources.FileFormatWriter: Job null committed. [2018-07-04 18:52:13,531] {base_task_runner.py:95} INFO - Subtask: [2018-07-04 18:52:13,530] {ssh_execute_operator.py:146} INFO - 18/07/04 18:52:13 INFO datasources.FileFormatWriter: Finished processing stats for job null.
Miss return ...
LAST EDIT : I removed everylogs (print/show) of every jobs, and it seems working.
来源:https://stackoverflow.com/questions/51177802/airflow-stops-following-spark-job-submitted-over-ssh