ETL model with DAGs and Tasks

帅比萌擦擦* 提交于 2019-12-02 18:15:19

问题


I'm trying to model my ETL jobs with Airflow. All jobs have kind of the same structure:

  1. Extract from a transactional database(N extractions, each one reading 1/N of the table)
  2. Then transform data
  3. Finally, insert the data into an analytic database

So E >> T >> L

This Company Routine USER >> PRODUCT >> ORDER has to run every 2 hours. Then I will have all the data from users and purchases.

How can I model it?

  • The Company Routine(USER >> PRODUCT >> ORDER ) must be a DAG and each job must be a separate Task? In this case, how can I model each step(E, T, L) inside the task and make them behave like "sub-tasks" in Airflow?
  • Or each job is a separate DAG? In this case. How can I say that I have to run The Company Routine(USER >> PRODUCT >> ORDER ) every 2h and they have these dependencies. Because as I could see, we can set cron time and dependencies only between tasks inside a DAG.

Diagram:

Now I'm using each Company Routine(USER >> PRODUCT >> ORDER ) as DAG and each job must be a separate Task.


回答1:


2nd option is better (have each sub-workflow of Company Routine as a top-level DAG) because

  • top-level DAGs can be re-run independently (in case just one of them needs to be rerun) while you cannot rerun just a part of a DAG (if you modelled them as a monolithic DAG)
  • same holds true for backfilling

But then you must link-up those top-level DAGs together too (so that they run one-after another). For that, see Wiring top-level DAGs together



来源:https://stackoverflow.com/questions/57077895/etl-model-with-dags-and-tasks

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!