Airflow run tasks at different times in the same dag?

一曲冷凌霜 提交于 2019-12-01 11:23:33

I can think of 3 possible solutions to your woes (will add more alternatives when they come to mind)

  1. Set start_date on individual tasks within the DAG (apart from a start_date of DAG itself) as told here. However I would never favour this approach because it would be like a step back onto the same time-based crons that Airflow tries to replace.

  2. Use pools to segregate tasks by runtime / priority. Here's an idea (you might need to rework as per your requirements): Put all tiny tasks in tiny_task_pool and all big ones in big_task_pool. Let the tiny_task_pool have significantly higher number of slots than big_task_pool. That would make starvation of your tiny-tasks much less likely. You can go creative with even more levels of pools.

  3. Even if your tasks have no real dependencies between them, it shouldn't hurt much to deliberately introduce some dependencies so that all (or most) big tasks are made downstream of tiny ones (and hence change structure of your DAG). That would dub into a shortest-job-first kind of approach. You can also explore priority_weight / priority_rule to gain even more fine-grained control.

All the above alternatives assume that tasks' lengths (duration of execution) are known ahead of time. In real-world, that might not be true; or even if it is, it might gradually change overtime. For that, I'd suggest you to tweak your dag-definition script to factor-in the average (or median) runtime of your tasks over last 'n' runs to decide their priority.

  • For start_date method, just supply a later start_date (actually same date, later time) to tasks that ran longer in previous runs
  • For pools method, move tasks around different pools based on their previous running durations
  • For task-dependency method, make longer running tasks downstream. This might sound difficult but you can visualize it like this: Create 3 DummyOperators and link them up (one after another). Now you have to fill-in all small tasks between the first 2 DummyOperators and the big ones between the next two.

This is likely because you have fewer execution slots than you have slow jobs. The scheduler doesn't particularly care what order it's running the tasks in, because you've said you don't care either.

If it really matters to you, these should probably be broken up into different dags or you should declare dependencies that you want the cheaper tasks to finish first. There are any number of ways to express what you want, you just have to figure out what that is.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!