Guarantee that some operators will be executed on the same airflow worker

痞子三分冷 提交于 2021-01-28 07:33:15


I have a DAG which

  1. downloads a csv file from cloud storage
  2. uploads the csv file to a 3rd party via https

The airflow cluster I am executing on uses CeleryExecutor by default, so I'm worried that at some point when I scale up the number of workers, these tasks may be executed on different workers. eg. worker A does the download, worker B tries to upload, but doesn't find the file (because it's on worker A)

Is it possible to somehow guarantee that both the download and upload operators will be executed on the same airflow worker?


For these kinds of use cases we have two solutions:

  1. Use a network mounted drive that is shared between the two workers so that both the downloading and uploading tasks have access to the same file system
  2. Use Airflow queue that is worker specific. If there is only one worker listening to this queue you will guarantee that both will have access to the same file system. Note that each worker can listen on multiple queues so you can have it listening on the "default" queue as well as the custom one intended for this task.


Put step 1 (the csv download) and step 2 (the csv upload) into a subdag, and then trigger it via the SubDagOperator with the executor option set to a SequentialExecutor - this will ensure that steps 1 and 2 run on the same worker.

Here is a working DAG file illustrating that concept (with the actual operations stubbed out as DummyOperators), with the download/upload steps in the context of some larger process:

from datetime import datetime, timedelta
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator
from airflow.executors.sequential_executor import SequentialExecutor


def make_sub_dag(parent_dag_name, child_dag_name, start_date, schedule_interval):
    dag = DAG(
        '%s.%s' % (parent_dag_name, child_dag_name),

    task_download = DummyOperator(
        task_id = 'task_download_csv',

    task_upload = DummyOperator(
        task_id = 'task_upload_csv',

    task_download >> task_upload

    return dag
main_dag = DAG(

main_task_1 = DummyOperator(
    task_id = 'main_1',
    dag = main_dag

main_task_2 = SubDagOperator(
    task_id = CHILD_DAG_NAME,
    subdag=make_sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, main_dag.start_date, main_dag.schedule_interval),

main_task_3 = DummyOperator(
    task_id = 'main_3',
    dag = main_dag

main_task_1 >> main_task_2 >> main_task_3

