Apache Airflow 1.10.3: Executor reports task instance ??? finished (failed) although the task says its queued. Was the task killed externally?

瘦欲@ 提交于 2019-12-01 04:53:11

问题


An Airflow ETL dag has the error every day

Our airflow installation is using CeleryExecutor. The concurrency configs were

# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32

# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16

# Are DAGs paused by default at creation
dags_are_paused_at_creation = True

# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128

# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
[celery]
# This section only applies if you are using the CeleryExecutor in
# [core] section above

# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor

# The concurrency that will be used when starting workers with the
# "airflow worker" command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
celeryd_concurrency = 16

We have a dag that executes daily. It has around 21 tasks in parallel following a pattern that senses whether the data exists in hdfs then sleep 10 mins, and finally upload to s3.

Some of the tasks has been encountering the following error:

2019-05-12 00:00:46,209 INFO - Executor reports wh_hdfs_to_s3.check_hdfs_data_dct_order_item_15 execution_date=2019-05-11 04:00:00+00:00 as failed for try_number 1
2019-05-12 00:00:46,212 ERROR - Executor reports task instance <TaskInstance: wh_hdfs_to_s3.check_hdfs_data_dct_order_item_15 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
2019-05-12 00:00:46,212 INFO - Filling up the DagBag from /opt/DataLoader/airflow/dags/wh_hdfs_to_s3.py
2019-05-12 00:00:46,425 INFO - Using connection to: id: wh_aws_mysql. Host: db1.prod.coex.us-east-1.aws.owneriq.net, Port: None, Schema: WAREHOUSE_MOST, Login: whuser, Password: XXXXXXXX, extra: {}
2019-05-12 00:00:46,557 ERROR - Executor reports task instance <TaskInstance: wh_hdfs_to_s3.check_hdfs_data_dct_order_item_15 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
None
2019-05-12 00:00:46,558 INFO - Marking task as UP_FOR_RETRY
2019-05-12 00:00:46,561 WARNING - section/key [smtp/smtp_user] not found in config
2019-05-12 00:00:46,640 INFO - Sent an alert email to [u'wh-report-admin@owneriq.com']
2019-05-12 00:00:46,679 INFO - Executor reports wh_hdfs_to_s3.check_hdfs_data_tbldimmostlineitem_105 execution_date=2019-05-11 04:00:00+00:00 as failed for try_number 1
2019-05-12 00:00:46,682 ERROR - Executor reports task instance <TaskInstance: wh_hdfs_to_s3.check_hdfs_data_tbldimmostlineitem_105 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
2019-05-12 00:00:46,682 INFO - Filling up the DagBag from /opt/DataLoader/airflow/dags/wh_hdfs_to_s3.py
2019-05-12 00:00:46,686 INFO - Using connection to: id: wh_aws_mysql. Host: db1.prod.coex.us-east-1.aws.owneriq.net, Port: None, Schema: WAREHOUSE_MOST, Login: whuser, Password: XXXXXXXX, extra: {}
2019-05-12 00:00:46,822 ERROR - Executor reports task instance <TaskInstance: wh_hdfs_to_s3.check_hdfs_data_tbldimmostlineitem_105 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
None
2019-05-12 00:00:46,822 INFO - Marking task as UP_FOR_RETRY
2019-05-12 00:00:46,826 WARNING - section/key [smtp/smtp_user] not found in config
2019-05-12 00:00:46,902 INFO - Sent an alert email to [u'wh-report-admin@owneriq.com']
2019-05-12 00:00:46,918 INFO - Executor reports wh_hdfs_to_s3.check_hdfs_data_tbldimdatasourcetag_135 execution_date=2019-05-11 04:00:00+00:00 as success for try_number 1
2019-05-12 00:00:46,921 INFO - Executor reports wh_hdfs_to_s3.check_hdfs_data_flight_69 execution_date=2019-05-11 04:00:00+00:00 as success for try_number 1
2019-05-12 00:00:46,923 INFO - Executor reports wh_hdfs_to_s3.check_hdfs_data_tbldimariamode_93 execution_date=2019-05-11 04:00:00+00:00 as success for try_number 1

This kind of error occurs randomly in those tasks. When this error happens, the state of task instance is immediately set to up_for_retry, and no logs in the worker nodes. After some retries, they execute and finished eventually.

This problem sometimes gives us large ETL delay. Anyone knows how to solve this problem?


回答1:


I was seeing very similar symptoms in my DagRuns. I thought it was due to the ExternalTaskSensor and concurrency issues given the queuing and killed task language that looked like this: Executor reports task instance <TaskInstance: dag1.data_table_temp_redshift_load 2019-05-20 08:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally? But when I looked at the worker logs, I saw there was an error caused by setting a variable with Variable.set in my DAG. The issue is described here duplicate key value violates unique constraint when adding path variable in airflow dag where the scheduler polls the dagbag at regular intervals to refresh any changes dynamically. The error with every heartbeat was causing significant ETL delays.

Are you performing any logic in your wh_hdfs_to_s3 DAG (or others) that might be causing errors or delays / these symptoms?




回答2:


We were facing similar problems , which was resolved by

"-x, --donot_pickle" option.

For more information :- https://airflow.apache.org/cli.html#backfill




回答3:


We fixed this already. Let me answer myself question:

We have 5 airflow worker nodes. After installing flower to monitor the tasks distributed to these nodes. We found out that the failed task was always sent to a specific node. We tried to use airflow test command to run the task in other nodes and they worked. Eventually, the reason was a wrong python package in that specific node.



来源:https://stackoverflow.com/questions/56119107/apache-airflow-1-10-3-executor-reports-task-instance-finished-failed-alth

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!