Guarantee that some operators will be executed on the same airflow worker

痞子三分冷 提交于 2021-01-28 07:33:15

问题


I have a DAG which

  1. downloads a csv file from cloud storage
  2. uploads the csv file to a 3rd party via https

The airflow cluster I am executing on uses CeleryExecutor by default, so I'm worried that at some point when I scale up the number of workers, these tasks may be executed on different workers. eg. worker A does the download, worker B tries to upload, but doesn't find the file (because it's on worker A)

Is it possible to somehow guarantee that both the download and upload operators will be executed on the same airflow worker?


回答1:


For these kinds of use cases we have two solutions:

  1. Use a network mounted drive that is shared between the two workers so that both the downloading and uploading tasks have access to the same file system
  2. Use Airflow queue that is worker specific. If there is only one worker listening to this queue you will guarantee that both will have access to the same file system. Note that each worker can listen on multiple queues so you can have it listening on the "default" queue as well as the custom one intended for this task.



回答2:


Put step 1 (the csv download) and step 2 (the csv upload) into a subdag, and then trigger it via the SubDagOperator with the executor option set to a SequentialExecutor - this will ensure that steps 1 and 2 run on the same worker.

Here is a working DAG file illustrating that concept (with the actual operations stubbed out as DummyOperators), with the download/upload steps in the context of some larger process:

from datetime import datetime, timedelta
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator
from airflow.executors.sequential_executor import SequentialExecutor

PARENT_DAG_NAME='subdaggy'
CHILD_DAG_NAME='subby'

def make_sub_dag(parent_dag_name, child_dag_name, start_date, schedule_interval):
    dag = DAG(
        '%s.%s' % (parent_dag_name, child_dag_name),
        schedule_interval=schedule_interval,
        start_date=start_date
        )

    task_download = DummyOperator(
        task_id = 'task_download_csv',
        dag=dag
        )

    task_upload = DummyOperator(
        task_id = 'task_upload_csv',
        dag=dag
        )

    task_download >> task_upload

    return dag
main_dag = DAG(
    PARENT_DAG_NAME,
    schedule_interval=None,
    start_date=datetime(2017,1,1)
)

main_task_1 = DummyOperator(
    task_id = 'main_1',
    dag = main_dag
)

main_task_2 = SubDagOperator(
    task_id = CHILD_DAG_NAME,
    subdag=make_sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, main_dag.start_date, main_dag.schedule_interval),
    executor=SequentialExecutor(),
    dag=main_dag
)

main_task_3 = DummyOperator(
    task_id = 'main_3',
    dag = main_dag
)

main_task_1 >> main_task_2 >> main_task_3


来源:https://stackoverflow.com/questions/45842564/guarantee-that-some-operators-will-be-executed-on-the-same-airflow-worker

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!