问题
I have a DAG that creates a cluster, starts computation tasks, and after they completed, tears down the cluster. I want to limit concurrency for the computation tasks carried on this cluster to fixed number. So logically, I need a pool that is exclusive to the cluster created by a task. I don't want interference with other DAGs or different runs of the same DAG.
I thought I could solve this problem by creating a pool dynamically from a task after the cluster is created and delete it once the computation tasks are finished. I thought I could template the pool
parameter of the computation tasks to make them use this dynamically created cluster.
# execute registers a pool and returns with the pool name
create_pool = CreatePoolOperator(
slots=4,
task_id='create_pool',
dag=self
)
# the pool parameter is templated
computation = ComputeOperator(
task_id=compute_subtask_name,
pool="{{ ti.xcom_pull(task_ids='create_pool') }}",
dag=self
)
create_pool >> computation
But this way the computqtion tasks will never be triggered. So I think the pool parameter is saved in the task instance before being templated. I would like to hear your thoughts on how to achieve the desired behavior.
回答1:
Instead of trying to get a dynamic pool to work, see if the concurrency
attribute on airflow.models.DAG
will do the trick. It will limit the number of running tasks inside the current run of the process.
回答2:
This answer will probably aggravate some but it's one possible path nonetheless and so it's worth documenting. The core feature that makes Airflow more powerful then it's competitors is that everything is defined using code. At the end of the day if Airflow does not provide us with a feature we can always just create the feature ourselves using Python.
You want the ability to pool tasks in a DAG but only for that specific DAG run. So try to just create a custom pool on your tasks. Here's some pseudo code off the top of my head
List<String> tasksPoolQueue = new ArrayList<String>();
def taskOnesFunction()
while true:
if tasksPoolQueue.get(0) == "taskOnesTurn":
print("Do some work it's your turn")
# Delete this run from the list and shift the list over to the left one index
# So that the next value is now the first value in the list
tasksPoolQueue.delete(0)
return 0
else:
sleep(10 seconds)
def taskTwosFunction()
while true:
if tasksPoolQueue.get(0) == "taskTwosTurn":
print("Do some work it's your turn")
# Delete this run from the list and shift the list over to the left one index
# So that the next value is now the first value in the list
tasksPoolQueue.delete(0)
return 0
else:
sleep(10 seconds)
def createLogicalOrderingOfTaskPoolQueue():
if foobar == true:
tasksPoolQueue[0] = "taskOnesTurn"
tasksPoolQueue[1] = "taskTwosTurn"
else:
tasksPoolQueue[0] = "taskTwosTurn"
tasksPoolQueue[1] = "taskOnesTurn"
return 0
determine_pool_queue_ordering = PythonOperator(
task_id='determine_pool_queue_ordering',
retries=0,
dag=dag,
provide_context=True,
python_callable=createLogicalOrderingOfTaskPoolQueue,
op_args=[])
task1 = PythonOperator(
task_id='task1',
retries=0,
dag=dag,
provide_context=True,
python_callable=taskOnesFunction,
op_args=[])
task2= PythonOperator(
task_id='task2',
retries=0,
dag=dag,
provide_context=True,
python_callable=taskTwosFunction,
op_args=[])
determine_pool_queue_ordering.set_downstream(task1)
determine_pool_queue_ordering.set_downstream(task2)
So hopefully everyone can follow my pseudo code. I don't know what the best way of creating a custom pool would be that doesn't introduce a "race condition" so this list queue idea was what I came up with at first glance. But the main point here is that both task1 and task2 will run at the same time BUT inside their function I can make it so that the function doesn't do anything meaningful until it gets past that if statement preventing it from running the real code.
The first task will dynamically set which tasks run first and in what order using the list. Then have all the functions that need to be in this custom pool reference that list. Since our if statements only equal true when their taskName is first in the list it essentially means that only one task can run at a time. The first task in the list will delete itself from the list once it's done processing whatever it needs to do. Then the other tasks will sleep while they wait for their task name to be first in the list.
So just make some custom logic similar to mine.
回答3:
Here is an operator that creates a pool if it doesn't exist.
from airflow.api.common.experimental.pool import get_pool, create_pool
from airflow.exceptions import PoolNotFound
from airflow.models import BaseOperator
from airflow.utils import apply_defaults
class CreatePoolOperator(BaseOperator):
# its pool blue, get it?
ui_color = '#b8e9ee'
@apply_defaults
def __init__(
self,
name,
slots,
description='',
*args, **kwargs):
super(CreatePoolOperator, self).__init__(*args, **kwargs)
self.description = description
self.slots = slots
self.name = name
def execute(self, context):
try:
pool = get_pool(name=self.name)
if pool:
self.log(f'Pool exists: {pool}')
return
except PoolNotFound:
# create the pool
pool = create_pool(name=self.name, slots=self.slots, description=self.description)
self.log(f'Created pool: {pool}')
deleting the pool could be done in a similar manner.
来源:https://stackoverflow.com/questions/52426489/create-dynamic-pool-in-airflow