airflow | 易学教程

Apache Airflow - get all parent task_ids

阅读更多关于 Apache Airflow - get all parent task_ids

问题 Suppose a following situation: [c1, c2, c3] >> child_task where all c1 , c2 , c3 and child_task are operators and have task_id equal to id1 , id2 , id3 and child_id respectively. Task child_task is also a PythonOperator with provide_context=True and python_callable=dummy_func def dummy_func(**context): #... Is it possible to retrieve all parents' ids inside the dummy_func (perhaps by browsing the dag somehow using the context)? Expected result in this case would be a list ['id1', 'id2', 'id3'

Airflow Get Retry Number

阅读更多关于 Airflow Get Retry Number

问题 In my Airflow DAG I have a task that needs to know if it's the first time it's ran or if it's a retry run. I need to adjust my logic in the task if it's a retry attempt. I have a few ideas on how I could store the number of retries for the task but I'm not sure if any of them are legitimate or if there's an easier built in way to get this information within the task. I'm wondering if I can just have an integer variable inside the dag that I append every time the task runs. Then if the task if

Airflow Python operator passing parameters

阅读更多关于 Airflow Python operator passing parameters

问题 I'm trying to write a Python operator in an airflow DAG and pass certain parameters to the Python callable. My code looks like below. def my_sleeping_function(threshold): print(threshold) fmfdependency = PythonOperator( task_id='poke_check', python_callable=my_sleeping_function, provide_context=True, op_kwargs={'threshold': 100}, dag=dag) end = BatchEndOperator( queue=QUEUE, dag=dag) start.set_downstream(fmfdependency) fmfdependency.set_downstream(end) But I keep getting the below error.

Scheduling AirfFlow DAG job

阅读更多关于 Scheduling AirfFlow DAG job

问题 I have written a AirFlow DAG as below - default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2016, 7, 5), 'email': ['airflow@airflow.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(seconds=30), # 'queue': 'bash_queue', # 'pool': 'backfill', # 'priority_weight': 10, # 'end_date': datetime(2016, 1, 1), } dag = DAG( 'test-air', default_args=default_args, schedule_interval='*/2 * * * *') ................. .........

Use XCom to exchange data between classes?

阅读更多关于 Use XCom to exchange data between classes?

问题 I have the following DAG, which executes the different methods with a class dedicated to a data preprocessing routine: from datetime import datetime import os import sys from airflow.models import DAG from airflow.operators.python_operator import PythonOperator import ds_dependencies SCRIPT_PATH = os.getenv('MARKETING_PREPROC_PATH') if SCRIPT_PATH: sys.path.insert(0, SCRIPT_PATH) from table_builder import OnlineOfflinePreprocess else: print('Define MARKETING_PREPROC_PATH value in

How to pass SQL as file with parameters to Airflow Operator

阅读更多关于 How to pass SQL as file with parameters to Airflow Operator

问题 I have an Operator in Airflow: import_orders_op = MySqlToGoogleCloudStorageOperator( task_id='import_orders', mysql_conn_id='con1', google_cloud_storage_conn_id='con2', provide_context=True, sql="""SELECT * FROM orders where orderid>{0}""".format(parameter), bucket=GCS_BUCKET_ID, filename=file_name, dag=dag) Now, the actual query I need to run is 24 rows long. I want to save it in a file and give the operator the path for the SQL file. The operator support this but I'm not sure what to do

Get all Airflow Leaf Nodes/Tasks

阅读更多关于 Get all Airflow Leaf Nodes/Tasks

问题 I want to build something where I need to capture all of the leaf tasks and add a downstream dependency to them to make a job complete in our database. Is there an easy way to find all the leaf nodes of a DAG in Airflow? 回答1: Use upstream_task_ids and downstream_task_ids @property from BaseOperator def get_start_tasks(dag: DAG) -> List[BaseOperator]: # returns list of "head" / "root" tasks of DAG return [task for task in dag.tasks if not task.upstream_task_ids] def get_end_tasks(dag: DAG) ->

Airflow kills my tasks after 1 minute

阅读更多关于 Airflow kills my tasks after 1 minute

问题 I have a very simple DAG with two tasks, like following: default_args = { 'owner': 'me', 'start_date': dt.datetime.today(), 'retries': 0, 'retry_delay': dt.timedelta(minutes=1) } dag = DAG( 'test DAG', default_args=default_args, schedule_interval=None ) t0 = PythonOperator( task_id="task 1", python_callable=run_task_1, op_args=[arg_1, args_2, args_3], dag=dag, execution_timeout=dt.timedelta(minutes=60) ) t1 = PythonOperator( task_id="task 2", python_callable=run_task_2, dag=dag, execution

Airflow: Re-run DAG from beginning with new schedule

阅读更多关于 Airflow: Re-run DAG from beginning with new schedule

问题 Backstory: I was running an Airflow job on a daily schedule, with a start_date of July 1, 2019. The job gathered requested each day's data from a third party, then loaded that data into our database. After running the job successfully for several days, I realized that the third party data source only refreshed their data once a month. As such, I was simply downloading the same data every day. At that point, I changed the start_date to a year ago (to get previous months' info), and changed the

Airflow: Re-run DAG from beginning with new schedule

阅读更多关于 Airflow: Re-run DAG from beginning with new schedule