Airflow dynamic DAG and Task Ids

后端 未结 2 1868
猫巷女王i
猫巷女王i 2020-12-09 03:32

I mostly see Airflow being used for ETL/Bid data related jobs. I\'m trying to use it for business workflows wherein a user action triggers a set of dependent tasks in future

相关标签:
2条回答
  • 2020-12-09 04:06

    After numerous trials and errors, I was able to figure this out. Hopefully, it will help someone. Here's how it works: You need to have an iterator or an external source (file/database table) to generate dags/task dynamically through a template. You can keep the dag and task names static, just assign them ids dynamically in order to differentiate one dag from the other. You put this python script in the dags folder. When you start the airflow scheduler, it runs through this script on every heartbeat and writes the DAGs to the dag table in the database. If a dag (unique dag id) has already been written, it will simply skip it. The scheduler also look at the schedule of individual DAGs to determine which one is ready for execution. If a DAG is ready for execution, it executes it and updates its status. Here's a sample code:

    from airflow.operators import PythonOperator
    from airflow.operators import BashOperator
    from airflow.models import DAG
    from datetime import datetime, timedelta
    import sys
    import time
    
    dagid   = 'DA' + str(int(time.time()))
    taskid  = 'TA' + str(int(time.time()))
    
    input_file = '/home/directory/airflow/textfile_for_dagids_and_schedule'
    
    def my_sleeping_function(random_base):
        '''This is a function that will run within the DAG execution'''
        time.sleep(random_base)
    
    def_args = {
        'owner': 'airflow',
        'depends_on_past': False,
        'start_date': datetime.now(), 'email_on_failure': False,                
        'retries': 1, 'retry_delay': timedelta(minutes=2)
    }
    with open(input_file,'r') as f:
        for line in f:
            args = line.strip().split(',')
        if len(args) < 6:
            continue
        dagid = 'DAA' + args[0]
        taskid = 'TAA' + args[0]
        yyyy    = int(args[1])
        mm      = int(args[2])
        dd      = int(args[3])
        hh      = int(args[4])
        mins    = int(args[5])
        ss      = int(args[6])
        dag = DAG(
            dag_id=dagid, default_args=def_args,
            schedule_interval='@once', start_date=datetime(yyyy,mm,dd,hh,mins,ss)
            )
    
        myBashTask = BashOperator(
            task_id=taskid,
            bash_command='python /home/directory/airflow/sendemail.py',
            dag=dag)
    
        task2id = taskid + '-X'
    
        task_sleep = PythonOperator(
            task_id=task2id,
            python_callable=my_sleeping_function,
            op_kwargs={'random_base': 10},
            dag=dag)
    
        task_sleep.set_upstream(myBashTask)
    
    f.close()
    
    0 讨论(0)
  • 2020-12-09 04:20

    From How can I create DAGs dynamically?:

    Airflow looks in you [sic] DAGS_FOLDER for modules that contain DAG objects in their global namespace, and adds the objects it finds in the DagBag. Knowing this all we need is a way to dynamically assign variable in the global namespace, which is easily done in python using the globals() function for the standard library which behaves like a simple dictionary.

    for i in range(10):
        dag_id = 'foo_{}'.format(i)
        globals()[dag_id] = DAG(dag_id)
        # or better, call a function that returns a DAG object!
    
    0 讨论(0)
提交回复
热议问题