How to really create n tasks in a SubDAG based on the result of a previous task

断了今生、忘了曾经 提交于 2019-12-05 11:35:43

I had the same issue, I couldn't properly solve 100% the problem in an "Airflow way" since I think that the number of airflow tasks and subtasks is defined in the moment of the DAG validation. And at the validation no task is run, therefore there is no way that airflow knows beforehand how many subdag.tasks will be scheduled.

The way I circumvented this issue might not be the best (I'm open to suggestions) but it works:

main_dag.py

# imports omitted for brevity
def get_info_from_db():
    # get info from db or somewhere else, this info will define the number of subdag tasks to run
    return urls, names

dag = DAG(...)

urls, names = get_info_from_db()

# You may ignore the dummy operators
start = DummyOperator(task_id='start', default_args=args, dag=dag)
sub_section = SubDagOperator(
    task_id='import-file',
    subdag=imported_subdag(DAG_NAME, 'subdag-name', args, urls=urls, file_names=names),
    default_args=args,
    dag=dag,
)
end = DummyOperator(task_id='end', default_args=args, dag=dag)

start.set_downstream(sub_section)
section_1.set_downstream(end)

Then finally I have my subdag.py (Make sure it is discoverable from airflow) in case it is in a separate file

# imports omitted for brevity
def fetch_files(file_url, file_name):
    # get file and save it to disk
    return file_location

# this is how I get info returned from the previous task: fetch_files
def validate_file(task_id, **kwargs):
    ti = kwargs['ti']
    task = 'fetch_file-{}'.format(task_id)
    file_location = ti.xcom_pull(task_ids=task)

def imported_subdag(parent_dag_name, child_dag_name, args, urls, file_names):
    dag_subdag = DAG(
        dag_id='%s.%s' % (parent_dag_name, child_dag_name),
        default_args=args,
        schedule_interval="@daily",
    )
    for i in range(len(urls)):
        # the task name should also be dynamic in order not to have duplicates
        validate_file_operator = PythonOperator(task_id='validate_file-{}'.format(i+1),
                                                python_callable=validate_file,
                                                provide_context=True, dag=dag_subdag, op_kwargs={'task_id': i + 1})
        fetch_operator = PythonOperator(task_id='fetch_file-{}'.format(i+1),
                                        python_callable=fetch_zip, dag=dag_subdag,
                                        op_kwargs={'file_url': urls[i], 'file_name': file_names[i]})
        fetch_operator.set_downstream(validate_file_operator)
    return dag_subdag

Basically my logic is that in the moment of the validation by Airflow get_info_from_db() gets executed and all dags and subdags are properly scheduled dynamically. If I add or remove content from the db, the number of tasks to be run will be updated in the next dag validation.

This approach suited my use case, but I hope in the future Airflow supports this feature (dynamic number of tasks/subdag.tasks) natively.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!