How can I control the parallelism or concurrency of an Airflow DAG?

这一生的挚爱 提交于 2019-12-18 11:36:17

问题


In some of my Airflow installations, DAGs or tasks that are scheduled to run do not run even when the scheduler is not fully loaded. How can I increase the number of DAGs or tasks that can run concurrently?

Similarly, if my installation is under high load and I want to limit how quickly my Airflow workers pull queued tasks, what can I adjust?


回答1:


Here's an expanded list of configuration options that are available in Airflow v1.10.2. Some can be set on a per-DAG or per-operator basis, and may fall back to the setup-wide defaults if unspecified.


Options that can be specified on a per-DAG basis:

  • concurrency: the number of task instances allowed to run concurrently across all active runs of the DAG this is set on. Defaults to core.dag_concurrency if not set
  • max_active_runs: maximum number of active runs for this DAG. The scheduler will not create new active DAG runs once this limit is hit. Defaults to core.max_active_runs_per_dag if not set

Examples:

# Only allow one run of this DAG to be running at any given time
dag = DAG('my_dag_id', max_active_runs=1)

# Allow a maximum of 10 tasks to be running across a max of 2 active DAG runs
dag = DAG('example2', concurrency=10, max_active_runs=2)

Options that can be specified on a per-operator basis:

  • pool: the pool to execute the task in. Pools can be used to limit parallelism for only a subset of tasks
  • task_concurrency: limit for per-task level concurrency

Example:

t1 = BaseOperator(pool='my_custom_pool', task_concurrency=12)

Options that are specified across an entire Airflow setup:

  • core.parallelism: maximum number of tasks running across an entire Airflow installation
  • core.dag_concurrency: max number of tasks that can be running per DAG (across multiple DAG runs)
  • core.non_pooled_task_slot_count: number of task slots allocated to tasks not running in a pool
  • core.max_active_runs_per_dag: maximum number of active DAG runs, per DAG
  • scheduler.max_threads: how many threads the scheduler process should use to use to schedule DAGs
  • celery.worker_concurrency: number of task instances that a worker will take if using CeleryExecutor
  • celery.sync_parallelism: number of processes CeleryExecutor should use to sync task state



回答2:


Check the airflow configuration for which core.executor is used. SequentialExecutor will be executing sequentially, so you can choose Local Executor or Clery Executor which execute the task parallel. After that, you can use other options as mentioned by @hexacyanide



来源:https://stackoverflow.com/questions/56370720/how-can-i-control-the-parallelism-or-concurrency-of-an-airflow-dag

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!