airflow | 易学教程

Airflow Scheduler Misunderstanding

阅读更多关于 Airflow Scheduler Misunderstanding

I'm new to Airflow. My goal is to run a dag, on a daily basis, starting 1 hour from now. I'm truly misunderstanding the airflow schedule "end-of-interval invoke" rules. From the docs [(Airflow Docs)][1] Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended. I set schedule_interval as followed: schedule_interval="00 15 * * *" and start_date as followed: start_date=datetime(year=2019, month=8, day=7) My assumption was, that if now it's 14

Airflow mysql to gcp Dag error

阅读更多关于 Airflow mysql to gcp Dag error

I'm recently started working with Airflow. I'm working on DAG that: Queries the MySQL database Extract the query and stores it in a cloud storage bucket as a JSON file Uploads stored JSON file to BigQuery Dag imports three operators: MySqlOperator , MySqlToGoogleCloudStorageOperator and GoogleCloudStorageToBigQueryOperator I am using Airflow 1.8.0, Python 3, and Pandas 0.19.0. Here is my Dag Code: sql2gcp_csv = MySqlToGoogleCloudStorageOperator( task_id='sql2gcp_csv', sql='airflow_gcp/aws_sql_extract_7days.sql', bucket='gs://{{var.value.gcs_bucket}}/{{ ds_nodash }}/', filename='{{ ds_nodash }}

ETL model with DAGs and Tasks

阅读更多关于 ETL model with DAGs and Tasks

I'm trying to model my ETL jobs with Airflow. All jobs have kind of the same structure: Extract from a transactional database(N extractions, each one reading 1/N of the table) Then transform data Finally, insert the data into an analytic database So E >> T >> L This Company Routine USER >> PRODUCT >> ORDER has to run every 2 hours. Then I will have all the data from users and purchases. How can I model it? The Company Routine (USER >> PRODUCT >> ORDER ) must be a DAG and each job must be a separate Task? In this case, how can I model each step(E, T, L) inside the task and make them behave like

Apache Airflow: Delay a task for some period of time

阅读更多关于 Apache Airflow: Delay a task for some period of time

问题 I am trying to execute a task after 5 minutes from the parent task inside a DAG. DAG : Task 1 ----> Wait for 5 minutes ----> Task 2 How can I achieve this in Apache Airflow? Thanks in advance. 回答1: You can add a TimeDeltaSensor with timedelta of 5 minutes between Task1 and Task2 . 回答2: The said behaviour can be achieved by introducing a task that forces a delay of specified duration between your Task 1 and Task 2 This can be achieved using PythonOperator import time from airflow.operators

MssqlHook airflow connection

阅读更多关于 MssqlHook airflow connection

问题 I am new to using airflow and what I need to do is to use MssqlHook but I do not know how. What elements should I give in the constructor? I have a connection in airflow with name connection_test. I do not fully understand the attributes in the class: class MsSqlHook(DbApiHook): """ Interact with Microsoft SQL Server. """ conn_name_attr = 'mssql_conn_id' default_conn_name = 'mssql_default' supports_autocommit = True I have the following code: sqlhook=MsSqlHook(connection_test) sqlhook.get

HowTo run parallel Spark job using Airflow

阅读更多关于 HowTo run parallel Spark job using Airflow

问题 We have existing code in production that runs Spark jobs in parallel. We tried to orchestrate some mundane spark jobs using Airflow and we had success BUT now we are not sure how to proceed with spark jobs in parallel. Can CeleryExecutor help in this case? Or should we modify our existing Spark job not to run in parallel. I do not like the latter approach personally. Our existing shell script that has spark job in parallel is something like this and we would like to run this shell script from

HowTo run parallel Spark job using Airflow

阅读更多关于 HowTo run parallel Spark job using Airflow

We have existing code in production that runs Spark jobs in parallel. We tried to orchestrate some mundane spark jobs using Airflow and we had success BUT now we are not sure how to proceed with spark jobs in parallel. Can CeleryExecutor help in this case? Or should we modify our existing Spark job not to run in parallel. I do not like the latter approach personally. Our existing shell script that has spark job in parallel is something like this and we would like to run this shell script from Airflow: cat outfile.txt | parallel -k -j2 submitspark {} /data/list Please suggest. 来源： https:/

Apache Airflow - get all parent task_ids

阅读更多关于 Apache Airflow - get all parent task_ids

Suppose a following situation: [c1, c2, c3] >> child_task where all c1 , c2 , c3 and child_task are operators and have task_id equal to id1 , id2 , id3 and child_id respectively. Task child_task is also a PythonOperator with provide_context=True and python_callable=dummy_func def dummy_func(**context): #... Is it possible to retrieve all parents' ids inside the dummy_func (perhaps by browsing the dag somehow using the context)? Expected result in this case would be a list ['id1', 'id2', 'id3'] . The upstream_task_ids and downstream_task_ids properties of BaseOperator are meant just for this

MssqlHook airflow connection

阅读更多关于 MssqlHook airflow connection

I am new to using airflow and what I need to do is to use MssqlHook but I do not know how. What elements should I give in the constructor? I have a connection in airflow with name connection_test. I do not fully understand the attributes in the class: class MsSqlHook(DbApiHook): """ Interact with Microsoft SQL Server. """ conn_name_attr = 'mssql_conn_id' default_conn_name = 'mssql_default' supports_autocommit = True I have the following code: sqlhook=MsSqlHook(connection_test) sqlhook.get_conn() And when I do this the error is Connection failed for unknown reason . How should I do in order to

In airflow, is there a good way to call another dag's task?

阅读更多关于 In airflow, is there a good way to call another dag's task?

I've got dag_prime and dag_tertiary. dag_prime : Scans through a directory and intends to call dag_tertiary on each one. Currently a PythonOperator. dag_tertiary : Scans through the directory passed to it and does (possibly time-intensive) calculations on the contents thereof. I can call the secondary one from a system call from the python operator, but i feel like there's got to be a better way. I'd also like to consider queuing the dag_tertiary calls, if there's a simple way to do that. Is there a better way than using system calls? Thanks! Him Use TriggerDagRunOperator for calling one DAG