Airflow starts two DAG runs when turned on for the first time

烂漫一生 提交于 2019-12-23 04:04:53

问题


When I boot up the Airflow webserver and scheduler for the first time on Oct 25th at around 17:23, and turn on my DAG, I can see that it kicks off two runs for Oct 23rd and Oct 24th:

RUN 1 -> 10-23T17:23
RUN 2 -> 10-24T17:23

Here's my DAG configuration:

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': '2019-01-01',
    'retries': 0,
}
dag = DAG(
    'my_script',
    default_args=default_args,
    schedule_interval=datetime.timedelta(days=1),
    catchup=False,
)

Since it's past the start_date + schedule_interval and I have set catchup=False, I would expect it to kick off a single DAG run immediately, however I would not expect it to run two.

  • Why are two DAG runs being executed?
  • How can I prevent this behaviour?

回答1:


I am not sure but this is my best guess -

In short answer, could be it is how airflow is built and workaround would be to modify your start_date to be yesterday.

TL;DR

I agree that kicks off 1 dag for 10-24 when you turned on would sound more natural.

However, according to your dag runs, RUN 1 is 10-23. This suggests to me that initializing of the first run is not correct and I have looked into the scheduler code.

And I have a doubt on this line.

https://github.com/apache/airflow/blob/68b8ec5f415795e4fa4ff7df35a3e75c712a7bad/airflow/jobs/scheduler_job.py#L603

This is inside a function that create a dag run and setting the start date of the run.

# The logic is that we move start_date up until
# one period before, so that timezone.utcnow() is AFTER
# the period end, and the job can be created...
now = timezone.utcnow()

# This returns current time + schedule_interval. In your example, this will be tomorrow.
next_start = dag.following_schedule(now)

# This returns current time - schedule_interval. In your example, this will be yesterday.
last_start = dag.previous_schedule(now)

# tomorrow <= today should return False 
if next_start <= now:
    new_start = last_start
else:
    # and this will return last_start - schedule_interval which means 2 days ago.  
    # wondering if this is intended to be dag.previous_schedule(next_start)???
    new_start = dag.previous_schedule(last_start) 


来源:https://stackoverflow.com/questions/58563313/airflow-starts-two-dag-runs-when-turned-on-for-the-first-time

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!