Create dynamic pool in Airflow

耗尽温柔 提交于 2019-12-13 00:24:55

问题


I have a DAG that creates a cluster, starts computation tasks, and after they completed, tears down the cluster. I want to limit concurrency for the computation tasks carried on this cluster to fixed number. So logically, I need a pool that is exclusive to the cluster created by a task. I don't want interference with other DAGs or different runs of the same DAG.

I thought I could solve this problem by creating a pool dynamically from a task after the cluster is created and delete it once the computation tasks are finished. I thought I could template the pool parameter of the computation tasks to make them use this dynamically created cluster.

# execute registers a pool and returns with the pool name
create_pool = CreatePoolOperator(
    slots=4,
    task_id='create_pool',
    dag=self
)

# the pool parameter is templated
computation = ComputeOperator(
    task_id=compute_subtask_name,
    pool="{{ ti.xcom_pull(task_ids='create_pool') }}",
    dag=self
)

create_pool >> computation

But this way the computqtion tasks will never be triggered. So I think the pool parameter is saved in the task instance before being templated. I would like to hear your thoughts on how to achieve the desired behavior.


回答1:


Instead of trying to get a dynamic pool to work, see if the concurrency attribute on airflow.models.DAG will do the trick. It will limit the number of running tasks inside the current run of the process.




回答2:


This answer will probably aggravate some but it's one possible path nonetheless and so it's worth documenting. The core feature that makes Airflow more powerful then it's competitors is that everything is defined using code. At the end of the day if Airflow does not provide us with a feature we can always just create the feature ourselves using Python.

You want the ability to pool tasks in a DAG but only for that specific DAG run. So try to just create a custom pool on your tasks. Here's some pseudo code off the top of my head

List<String> tasksPoolQueue = new ArrayList<String>();

def taskOnesFunction() 

  while true:

    if tasksPoolQueue.get(0) == "taskOnesTurn":
       print("Do some work it's your turn")

       # Delete this run from the list and shift the list over to the left one index
       # So that the next value is now the first value in the list
       tasksPoolQueue.delete(0)

       return 0

    else:
      sleep(10 seconds)

def taskTwosFunction()

  while true:

    if tasksPoolQueue.get(0) == "taskTwosTurn":
       print("Do some work it's your turn")

       # Delete this run from the list and shift the list over to the left one index
       # So that the next value is now the first value in the list
       tasksPoolQueue.delete(0)

       return 0

    else:
      sleep(10 seconds)

def createLogicalOrderingOfTaskPoolQueue():

    if foobar == true:
      tasksPoolQueue[0] = "taskOnesTurn"
      tasksPoolQueue[1] = "taskTwosTurn"
    else:
      tasksPoolQueue[0] = "taskTwosTurn"
      tasksPoolQueue[1] = "taskOnesTurn"

    return 0


determine_pool_queue_ordering = PythonOperator(
    task_id='determine_pool_queue_ordering',
    retries=0,
    dag=dag,
    provide_context=True,
    python_callable=createLogicalOrderingOfTaskPoolQueue,
    op_args=[])

task1 = PythonOperator(
    task_id='task1',
    retries=0,
    dag=dag,
    provide_context=True,
    python_callable=taskOnesFunction,
    op_args=[])

task2= PythonOperator(
    task_id='task2',
    retries=0,
    dag=dag,
    provide_context=True,
    python_callable=taskTwosFunction,
    op_args=[])

determine_pool_queue_ordering.set_downstream(task1)
determine_pool_queue_ordering.set_downstream(task2)

So hopefully everyone can follow my pseudo code. I don't know what the best way of creating a custom pool would be that doesn't introduce a "race condition" so this list queue idea was what I came up with at first glance. But the main point here is that both task1 and task2 will run at the same time BUT inside their function I can make it so that the function doesn't do anything meaningful until it gets past that if statement preventing it from running the real code.

The first task will dynamically set which tasks run first and in what order using the list. Then have all the functions that need to be in this custom pool reference that list. Since our if statements only equal true when their taskName is first in the list it essentially means that only one task can run at a time. The first task in the list will delete itself from the list once it's done processing whatever it needs to do. Then the other tasks will sleep while they wait for their task name to be first in the list.

So just make some custom logic similar to mine.




回答3:


Here is an operator that creates a pool if it doesn't exist.

from airflow.api.common.experimental.pool import get_pool, create_pool
from airflow.exceptions import PoolNotFound
from airflow.models import BaseOperator
from airflow.utils import apply_defaults


class CreatePoolOperator(BaseOperator):
    # its pool blue, get it?
    ui_color = '#b8e9ee'

    @apply_defaults
    def __init__(
            self,
            name,
            slots,
            description='',
            *args, **kwargs):
        super(CreatePoolOperator, self).__init__(*args, **kwargs)
        self.description = description
        self.slots = slots
        self.name = name

    def execute(self, context):
        try:
            pool = get_pool(name=self.name)
            if pool:
                self.log(f'Pool exists: {pool}')
                return
        except PoolNotFound:
            # create the pool
            pool = create_pool(name=self.name, slots=self.slots, description=self.description)
            self.log(f'Created pool: {pool}')

deleting the pool could be done in a similar manner.



来源:https://stackoverflow.com/questions/52426489/create-dynamic-pool-in-airflow

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!