Dynamically create list of tasks

孤者浪人 提交于 2019-12-11 01:14:16

问题


I have a DAG which is created by querying DynamoDB for a list and for each item in the list a task is created using a PythonOperator and adding it to the DAG. Not show in the example below but it's important to note that some of the items on the list depend upon other tasks so I'm using set_upstream to enforce the dependencies.

- airflow_home
  \- dags
    \- workflow.py

workflow.py

def get_task_list():
    # ... query dynamodb ...

def run_task(task):
    # ... do stuff ...

dag = DAG(dag_id='my_dag', ...)
tasks = get_task_list()
for task in tasks:
    t = PythonOperator(
        task_id=task['id'],
        provide_context=False,
        dag=dag,
        python_callable=run_task,
        op_args=[task]
    )

The problem is workflow.py is getting run over and over (every time a task runs?) and my get_task_list() method is getting throttled by AWS and throwing exceptions.

I thought it was because whenever run_task() was called it was running all the globals in workflow.py so I've tried moving run_task() into a separate module like this:

- airflow_home
  \- dags
    \- workflow.py
    \- mypackage
      \- __init__
      \- task.py

But it didn't change anything. I've even tried putting get_task_list() into a SubDagOperator wrapped with a factory function, which still behaves the same way.

Is my problem related to these issues?

  • Tasks added to DAG during runtime fail to be scheduled
  • How to nest an Airflow DAG dynamically?

Also, why is workflow.py getting run so often and why would an error thrown by get_task_list() cause the individual task to fail when the task method doesn't reference workflow.py and has no dependencies on it?

Most importantly, what would be the best way to both process the list in parallel and enforce any dependencies between items in the list?


回答1:


As per the questions you referenced, airflow doesn't support task creation while dag is running.

Therefore what happens is that airflow will periodically generate the complete DAG definition before it starts a run. Ideally, the period of such generation should be the same as schedule interval for that DAG.

BUT it might be that every time airflow checks for changes in dag, it is also generating the complete dag, causing too many requests. That time is controlled using the configurations min_file_process_interval and dag_dir_list_interval in airflow.cfg.

Regarding the failure of tasks, they fail because the dag creation itself failed and airflow wasn't able to start them.



来源:https://stackoverflow.com/questions/45119993/dynamically-create-list-of-tasks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!