问题
I would like to manage a couple of future releases using Apache airflow. All of these releases are known way in advance and I need to make sure some data pushing won't be forgotten.
The problem is that those future release do not follow a simple periodic schedule that could be handled with a classic cron like 0 1 23 * *
or something like @monthly
.
It's rather 2019-08-24
, 2019-09-30
, 2019-10-20
...
Is there another way but to create a seperate mydag.py
file for all of those future releases? What is the standard / recommended way to do this? Am I thinking about this the wrong way (I wonder because the documentation and tutorials rather focus on the regular, periodic thing)?
回答1:
I can think of two simple ways of doing this
Create 3-4 top-level DAGs, each having specific
start_date
= 2019-08-24, 2019-09-30... and schedule_interval='@once'Create a single top-level DAG having
schedule_interval=None
(start_date
can be anything). Then create a "triggering-dag", that employs TriggerDagRunOperator to conditionally trigger your actual workflow on specific dates
Clearly the method 2 above is better
回答2:
You could give your DAG a @daily
schedule, then start it with a ShortCircuitOperator task that checks to see if the execution date matches a release date. If it is, you pass the check and the DAG runs. Otherwise, it skips the entire DAG and no release happens. See an example of this operator being used in https://github.com/apache/airflow/blob/1.10.3/airflow/example_dags/example_short_circuit_operator.py.
I imagine it'd look something like this:
RELEASE_DATES = ['2019-08-24', '2019-09-30', '2019-10-20']
dag = DAG(
dag_id='my_dag',
schedule_interval='@daily',
default_args=default_args,
)
def check_release_date(**context):
# pass if it's a release day
return context['ds'] in RELEASE_DATES
skip_if_not_release_date = ShortCircuitOperator(
task_id='skip_if_not_release_date',
python_callable=check_release_date,
dag=dag,
provide_context=True,
)
If release dates can change, then you might want to make this a little more dynamic with variables to make updates easy.
def check_release_date(**context):
release_dates = Variable.get('release_dates', deserialize_json=True)
return context['ds'] in RELEASE_DATES
Also if for whatever reason you need to override your hardcoded list of release dates, you can mark this task as success to force the DAG to run.
来源:https://stackoverflow.com/questions/57226707/execute-airflow-dag-instances-tasks-on-a-list-of-specific-dates