airflow

How to dynamically create subdags in Airflow

我是研究僧i 提交于 2020-02-20 10:53:29
问题 I have a main dag which retrieves a file and splits the data in this file to separate csv files. I have another set of tasks that must be done for each file of these csv files. eg (Uploading to GCS, Inserting to BigQuery) How can I generate a SubDag for each file dynamically based on the number of files? SubDag will define the tasks like Uploading to GCS, Inserting to BigQuery, deleting the csv file) So right now, this is what it looks like main_dag = DAG(....) download_operator =

How to dynamically create subdags in Airflow

♀尐吖头ヾ 提交于 2020-02-20 10:52:25
问题 I have a main dag which retrieves a file and splits the data in this file to separate csv files. I have another set of tasks that must be done for each file of these csv files. eg (Uploading to GCS, Inserting to BigQuery) How can I generate a SubDag for each file dynamically based on the number of files? SubDag will define the tasks like Uploading to GCS, Inserting to BigQuery, deleting the csv file) So right now, this is what it looks like main_dag = DAG(....) download_operator =

How often is dag definition file read during a single dag run (is dag reevaluated / recalculated every time a task runs / fires)?

∥☆過路亽.° 提交于 2020-02-06 08:05:08
问题 How often is a dag definition file read during a single dag run? Have a large dag that takes long amount of time to build (~1-3min). Looking at the logs of each task as the dag is running it appears that the dag definition file is being executed for every task before it runs... *** Reading local file: /home/airflow/airflow/logs/mydag/mytask/2020-01-30T04:51:34.621883+00:00/1.log [2020-01-29 19:02:10,844] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01

Airflow: Can you put descriptions of the tasks so that they show up in dashboard?

▼魔方 西西 提交于 2020-02-04 18:42:15
问题 I can't seem to find a way to put descriptions about the Airflow tasks so that they show up in the Dashboard. I am reading their documentation but can't find there either. Does anyone know if this is possible? 回答1: You can document both DAGs and tasks with either doc or doc_<json|yaml|md|rst> fields depending on how you want it formatted. These will show up on the dashboard under "Graph View" for DAGs and "Task Details" for tasks. Example: """ # Foo Hello, these are DAG docs. """ ... dag =

Airflow: Can you put descriptions of the tasks so that they show up in dashboard?

萝らか妹 提交于 2020-02-04 18:42:12
问题 I can't seem to find a way to put descriptions about the Airflow tasks so that they show up in the Dashboard. I am reading their documentation but can't find there either. Does anyone know if this is possible? 回答1: You can document both DAGs and tasks with either doc or doc_<json|yaml|md|rst> fields depending on how you want it formatted. These will show up on the dashboard under "Graph View" for DAGs and "Task Details" for tasks. Example: """ # Foo Hello, these are DAG docs. """ ... dag =

Airflow: Can you put descriptions of the tasks so that they show up in dashboard?

你离开我真会死。 提交于 2020-02-04 18:41:49
问题 I can't seem to find a way to put descriptions about the Airflow tasks so that they show up in the Dashboard. I am reading their documentation but can't find there either. Does anyone know if this is possible? 回答1: You can document both DAGs and tasks with either doc or doc_<json|yaml|md|rst> fields depending on how you want it formatted. These will show up on the dashboard under "Graph View" for DAGs and "Task Details" for tasks. Example: """ # Foo Hello, these are DAG docs. """ ... dag =

How to deploy modified airflow dag from a different start time?

放肆的年华 提交于 2020-02-04 05:14:05
问题 Lets say scheduler is stopped for 5 hours and I had dag scheduled for twice every hour. Now when I restart the scheduler I do not want to airflow to backfill all the instances those were missed, Instead I want it to continue from the current hour. 回答1: To achieve this behavior, you can use the LatestOnlyOperator , which was just recently introduced to master, to the start of your DAG. It is not currently part of a released version though (1.7.1.3 is the latest version as of the writing of

Airflow DAG Running Every Second Rather Than Every Minute

感情迁移 提交于 2020-02-03 12:59:32
问题 I'm trying to schedule my DAG to run every minute but it seems to be running every second instead. Based on everything I've read I should just need to include schedule_interval='*/1 * * * *', #..every 1 minute in my DAG and that's it but it's not working. Here a simple example I setup to test it out: from airflow import DAG from airflow.operators import SimpleHttpOperator, HttpSensor, EmailOperator, S3KeySensor from datetime import datetime, timedelta from airflow.operators.bash_operator

How to integrate Airflow with Github for running scripts

核能气质少年 提交于 2020-02-03 08:19:25
问题 If we maintain our code/scripts in github repository account, is there any way to copy these scripts from Github repository and execute on some other cluster ( which can be Hadoop or Spark). Does airflow provides any operator to connect to Github for fetching such files ? Maintaining scripts in Github will provide more flexibility as every change in the code will be reflected and used directly from there. Any idea on this scenario will really help. 回答1: You can use GitPython as part of a

Running airflow tasks/dags in parallel

拟墨画扇 提交于 2020-02-03 04:05:49
问题 I'm using airflow to orchestrate some python scripts. I have a "main" dag from which several subdags are run. My main dag is supposed to run according to the following overview: I've managed to get to this structure in my main dag by using the following lines: etl_internal_sub_dag1 >> etl_internal_sub_dag2 >> etl_internal_sub_dag3 etl_internal_sub_dag3 >> etl_adzuna_sub_dag etl_internal_sub_dag3 >> etl_adwords_sub_dag etl_internal_sub_dag3 >> etl_facebook_sub_dag etl_internal_sub_dag3 >> etl