airflow

Apache Airflow - get all parent task_ids

与世无争的帅哥 提交于 2019-12-24 00:15:23
问题 Suppose a following situation: [c1, c2, c3] >> child_task where all c1 , c2 , c3 and child_task are operators and have task_id equal to id1 , id2 , id3 and child_id respectively. Task child_task is also a PythonOperator with provide_context=True and python_callable=dummy_func def dummy_func(**context): #... Is it possible to retrieve all parents' ids inside the dummy_func (perhaps by browsing the dag somehow using the context)? Expected result in this case would be a list ['id1', 'id2', 'id3'

Airflow Get Retry Number

亡梦爱人 提交于 2019-12-23 18:38:05
问题 In my Airflow DAG I have a task that needs to know if it's the first time it's ran or if it's a retry run. I need to adjust my logic in the task if it's a retry attempt. I have a few ideas on how I could store the number of retries for the task but I'm not sure if any of them are legitimate or if there's an easier built in way to get this information within the task. I'm wondering if I can just have an integer variable inside the dag that I append every time the task runs. Then if the task if

Airflow Python operator passing parameters

孤者浪人 提交于 2019-12-23 17:27:45
问题 I'm trying to write a Python operator in an airflow DAG and pass certain parameters to the Python callable. My code looks like below. def my_sleeping_function(threshold): print(threshold) fmfdependency = PythonOperator( task_id='poke_check', python_callable=my_sleeping_function, provide_context=True, op_kwargs={'threshold': 100}, dag=dag) end = BatchEndOperator( queue=QUEUE, dag=dag) start.set_downstream(fmfdependency) fmfdependency.set_downstream(end) But I keep getting the below error.

Scheduling AirfFlow DAG job

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-23 17:14:15
问题 I have written a AirFlow DAG as below - default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2016, 7, 5), 'email': ['airflow@airflow.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(seconds=30), # 'queue': 'bash_queue', # 'pool': 'backfill', # 'priority_weight': 10, # 'end_date': datetime(2016, 1, 1), } dag = DAG( 'test-air', default_args=default_args, schedule_interval='*/2 * * * *') ................. .........

Use XCom to exchange data between classes?

邮差的信 提交于 2019-12-23 15:30:08
问题 I have the following DAG, which executes the different methods with a class dedicated to a data preprocessing routine: from datetime import datetime import os import sys from airflow.models import DAG from airflow.operators.python_operator import PythonOperator import ds_dependencies SCRIPT_PATH = os.getenv('MARKETING_PREPROC_PATH') if SCRIPT_PATH: sys.path.insert(0, SCRIPT_PATH) from table_builder import OnlineOfflinePreprocess else: print('Define MARKETING_PREPROC_PATH value in

How to pass SQL as file with parameters to Airflow Operator

眉间皱痕 提交于 2019-12-23 13:39:09
问题 I have an Operator in Airflow: import_orders_op = MySqlToGoogleCloudStorageOperator( task_id='import_orders', mysql_conn_id='con1', google_cloud_storage_conn_id='con2', provide_context=True, sql="""SELECT * FROM orders where orderid>{0}""".format(parameter), bucket=GCS_BUCKET_ID, filename=file_name, dag=dag) Now, the actual query I need to run is 24 rows long. I want to save it in a file and give the operator the path for the SQL file. The operator support this but I'm not sure what to do

Get all Airflow Leaf Nodes/Tasks

我怕爱的太早我们不能终老 提交于 2019-12-23 10:13:07
问题 I want to build something where I need to capture all of the leaf tasks and add a downstream dependency to them to make a job complete in our database. Is there an easy way to find all the leaf nodes of a DAG in Airflow? 回答1: Use upstream_task_ids and downstream_task_ids @property from BaseOperator def get_start_tasks(dag: DAG) -> List[BaseOperator]: # returns list of "head" / "root" tasks of DAG return [task for task in dag.tasks if not task.upstream_task_ids] def get_end_tasks(dag: DAG) ->

Airflow kills my tasks after 1 minute

只谈情不闲聊 提交于 2019-12-23 09:57:15
问题 I have a very simple DAG with two tasks, like following: default_args = { 'owner': 'me', 'start_date': dt.datetime.today(), 'retries': 0, 'retry_delay': dt.timedelta(minutes=1) } dag = DAG( 'test DAG', default_args=default_args, schedule_interval=None ) t0 = PythonOperator( task_id="task 1", python_callable=run_task_1, op_args=[arg_1, args_2, args_3], dag=dag, execution_timeout=dt.timedelta(minutes=60) ) t1 = PythonOperator( task_id="task 2", python_callable=run_task_2, dag=dag, execution

Airflow: Re-run DAG from beginning with new schedule

十年热恋 提交于 2019-12-23 09:56:27
问题 Backstory: I was running an Airflow job on a daily schedule, with a start_date of July 1, 2019. The job gathered requested each day's data from a third party, then loaded that data into our database. After running the job successfully for several days, I realized that the third party data source only refreshed their data once a month. As such, I was simply downloading the same data every day. At that point, I changed the start_date to a year ago (to get previous months' info), and changed the

Airflow: Re-run DAG from beginning with new schedule

好久不见. 提交于 2019-12-23 09:55:49
问题 Backstory: I was running an Airflow job on a daily schedule, with a start_date of July 1, 2019. The job gathered requested each day's data from a third party, then loaded that data into our database. After running the job successfully for several days, I realized that the third party data source only refreshed their data once a month. As such, I was simply downloading the same data every day. At that point, I changed the start_date to a year ago (to get previous months' info), and changed the