airflow

How to set dependencies between DAGs in Airflow?

安稳与你 提交于 2019-11-27 00:21:59
问题 I am using Airflow to schedule batch jobs. I have one DAG (A) that runs every night and another DAG (B) that runs once per month. B depends on A having completed successfully. However B takes a long time to run and so I would like to keep it in a separate DAG to allow better SLA reporting. How can I make running DAG B dependent on a successful run of DAG A on the same day? 回答1: You can achieve this behavior using an operator called ExternalTaskSensor. Your task (B1) in DAG(B) will be

Airflow parallelism

删除回忆录丶 提交于 2019-11-27 00:06:50
问题 the Local Executor spawns new processes while scheduling tasks. Is there a limit to the number of processes it creates. I needed to change it. I need to know what is the difference between scheduler's "max_threads" and "parallelism" in airflow.cfg ? 回答1: parallelism: not a very descriptive name. The description says it sets the maximum task instances for the airflow installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow installed on two hosts,

Airflow default on_failure_callback

╄→尐↘猪︶ㄣ 提交于 2019-11-26 23:03:02
问题 In my DAG file, I have define a on_failure_callback() function to post a Slack in case of failure. It works well if I specify for each operator in my DAG : on_failure_callback=on_failure_callback() Is there a way to automate (via default_args for instance, or via my DAG object) the dispatch to all of my operators? 回答1: I finally found a way to do that. You can pass your on_failure_callback as a default_args class Foo: @staticmethod def get_default_args(): """ Return default args :return:

How to dynamically iterate over the output of an upstream task to create parallel tasks in airflow?

人盡茶涼 提交于 2019-11-26 20:45:51
问题 Consider the following example of a DAG where the first task, get_id_creds , extracts a list of credentials from a database. This operation tells me what users in my database I am able to run further data preprocessing on and it writes those ids to the file /tmp/ids.txt . I then scan those ids into my DAG and use them to generate a list of upload_transaction tasks that can be run in parallel. My question is: Is there a more idiomatically correct, dynamic way to do this using airflow? What I

How to prevent airflow from backfilling dag runs?

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-26 20:31:27
问题 Say you have an airflow DAG that doesn't make sense to backfill, meaning that, after it's run once, running it subsequent times quickly would be completely pointless. For example, if you're loading data from some source that is only updated hourly into your database, backfilling, which occurs in rapid succession, would just be importing the same data again and again. This is especially annoying when you instantiate a new hourly task, and it runs N amount of times for each hour it missed,

Why is it recommended against using a dynamic start_date in Airflow?

被刻印的时光 ゝ 提交于 2019-11-26 17:24:16
问题 I've read Airflow's FAQ about "What's the deal with start_date?", but it still isn't clear to me why it is recommended against using dynamic start_date . To my understanding, a DAG's execution_date is determined by the minimum start_date between all of the DAG's tasks, and subsequent DAG Runs are ran at the latest execution_date + schedule_interval . If I set my DAG's default_args start_date to be for, say, yesterday at 20:00:00 , with a schedule_interval of 1 day, how would that break or

Airflow “This DAG isnt available in the webserver DagBag object ”

戏子无情 提交于 2019-11-26 16:23:34
问题 when I put a new DAG python script in the dags folder, I can view a new entry of DAG in the DAG UI but it was not enabled automatically. On top of that, it seems does not loaded properly as well. I can only click on the Refresh button few times on the right side of the list and toggle the on/off button on the left side of the list to be able to schedule the DAG. These are manual process as I need to trigger something even though the DAG Script was put inside the dag folder. Anyone can help me

Airflow 1.9.0 is queuing but not launching tasks

时间秒杀一切 提交于 2019-11-26 16:19:20
Airflow is randomly not running queued tasks some tasks dont even get queued status. I keep seeing below in the scheduler logs [2018-02-28 02:24:58,780] {jobs.py:1077} INFO - No tasks to consider for execution. I do see tasks in database that either have no status or queued status but they never get started. The airflow setup is running https://github.com/puckel/docker-airflow on ECS with Redis. There are 4 scheduler threads and 4 Celery worker tasks. For the tasks that are not running are showing in queued state (grey icon) when hovering over the task icon operator is null and task details

airflow 安装步骤

我怕爱的太早我们不能终老 提交于 2019-11-26 12:12:15
因为python2.7各种版本的问题,所以最终使用 python3.6 1、下载anaconda3 2、通过conda 创建虚拟环境 3、根据airflow 的官方文档 Quick Start, 部署demo 文档 URL: http://airflow.apache.org/start.html -- 将数据存储从sqlite变成mysql 4、安装mysql 文档 URL: https://dinfratechsource.com/2018/11/10/how-to-install-latest-mysql-5-7-21-on-rhel-centos-7/ 5、设置mysql a.在my.cnf中的[mysqld]下,加入 explicit_defaults_for_timestamp=1 b.在mysql 中创建一个airflow的库 6、修改airflow 的airflow.cfg文件 参考官方文档的 Initializing a Database Backend 文档 URL: http://airflow.apache.org/howto/initialize-database.html 主要涉及2点: a.修改 airflow.cfg 文件中的 executor, 将其值修改为 LocalExecutor b.修改 airflow.cfg 文件中的 sql

Create and use Connections in Airflow operator at runtime [duplicate]

这一生的挚爱 提交于 2019-11-26 10:01:05
问题 This question already has an answer here: Is there a way to create/modify connections through Airflow API 2 answers Note: This is NOT a duplicate of Export environment variables at runtime with airflow Set Airflow Env Vars at Runtime I have to trigger certain tasks at remote systems from my Airflow DAG . The straight-forward way to achieve this is SSHHook. The problem is that the remote system is an EMR cluster which is itself created at runtime (by an upstream task ) using