airflow | 易学教程

How to run airflow scheduler as a daemon process?

阅读更多关于 How to run airflow scheduler as a daemon process?

I am new to Airflow. I am trying to run airflow scheduler as a daemon process, but the process does not live for long. I have configured "LocalExecutor" in airflow.cfg file and ran the following command to start the scheduler.(I am using Google compute engine and accessing server via PuTTY) airflow scheduler --daemon --num_runs=5 --log-file=/root/airflow/logs/scheduler.log When I run this command, the airflow scheduler starts and I can see the airflow-scheduler.pid file in my airflow home folder, but the process does not live for long. When I close the PuTTY session and reconnect to the server

airflow的web任务管理

阅读更多关于 airflow的web任务管理

ariflow里绿的代表都跑完了: 红的表示有问题: 点红的图标进去: 点tree view 红色表示那一天失败: 点进去看可以看log: 点clear则是重跑任务: 来源： https://www.cnblogs.com/hongfeng2019/p/11847600.html

airflow原理

阅读更多关于 airflow原理

官网: http://airflow.apache.org/installation.html 原理: https://www.cnblogs.com/cord/p/9450910.html 原理介绍： airflow 的守护进程 airflow 系统在运行时有许多守护进程，它们提供了 airflow 的全部功能。守护进程包括 Web服务器-webserver、调度程序-scheduler、执行单元-worker、消息队列监控工具-Flower等。下面是 apache-airflow 集群、高可用部署的主要守护进程。 webserver webserver 是一个守护进程，它接受 HTTP 请求，允许你通过 Python Flask Web 应用程序与 airflow 进行交互，webserver 提供以下功能：中止、恢复、触发任务。监控正在运行的任务，断点续跑任务。执行 ad-hoc 命令或 SQL 语句来查询任务的状态，日志等详细信息。配置连接，包括不限于数据库、ssh 的连接等。 webserver 守护进程使用 gunicorn 服务器（相当于 java 中的 tomcat ）处理并发请求，可通过修改{AIRFLOW_HOME}/airflow.cfg文件中 workers 的值来控制处理并发请求的进程数。例如： workers = 4

Creating connection outside of Airflow GUI

阅读更多关于 Creating connection outside of Airflow GUI

I would like to create S3 connection without interacting Airflow GUI. Is it possible through airflow.cfg or command line? We are using AWS role and following connection parameter works for us: {"aws_account_id":"xxxx","role_arn":"yyyyy"} So, manually creating connection on GUI for S3 is working, now we want to automate this process and want to add it as part of the Airflow deployment process. Any work around? You can use the airflow CLI. Unfortunately there is no support for editing connections, so you would have to remove and add as part of your deployment process, e.g.: airflow connections

Airflow Jinja Rendered Template

阅读更多关于 Airflow Jinja Rendered Template

I've been able to successfully render Jinja Templates using the function within the BaseOperator, render_template . My question is does anyone know the requirements to get rendered strings into the UI under the Rendered or Rendered Template tab? Referring to this tab in the UI: Any help or guidance here would be appreciated. If you are using templated fields in an Operator, the created strings out of the templated fields will be shown there. E.g. with a BashOperator: example_task = BashOperator( task_id='task_example_task', bash_command='mycommand --date {{ task_instance.execution_date }}',

Run parallel tasks in Apache Airflow

阅读更多关于 Run parallel tasks in Apache Airflow

I am able to configure airflow.cfg file to run tasks one after the other. What I want to do is, execute tasks in parallel, e.g. 2 at a time and reach the end of list. How can I configure this? Taylor Edmiston Executing tasks in Airflow in parallel depends on which executor you're using, e.g., SequentialExecutor , LocalExecutor , CeleryExecutor , etc. For a simple setup, you can achieve parallelism by just setting your executor to LocalExecutor in your airflow.cfg: [core] executor = LocalExecutor Reference: https://github.com/apache/incubator-airflow/blob

In airflow, is there a good way to call another dag's task?

阅读更多关于 In airflow, is there a good way to call another dag's task?

问题 I've got dag_prime and dag_tertiary. dag_prime : Scans through a directory and intends to call dag_tertiary on each one. Currently a PythonOperator. dag_tertiary : Scans through the directory passed to it and does (possibly time-intensive) calculations on the contents thereof. I can call the secondary one from a system call from the python operator, but i feel like there's got to be a better way. I'd also like to consider queuing the dag_tertiary calls, if there's a simple way to do that. Is

Scheduling dag runs in Airflow

阅读更多关于 Scheduling dag runs in Airflow

问题 Got a general query on Airflow Is it possible to have a dag file scheduled based on another dag file's schedule. For example, if I have 2 dags namely dag1 and dag2. I am trying to see if I can have dag2 run each time dag1 is successful else dag2 does not run. Is this possible in Airflow. 回答1: You will want to add a TriggerDagRunOperator the end of dag1 and set the schedule of dag2 to None . In addition, if you want to handle multiple cases for the output of dag1 , you can add in a

Airflow: dag_id could not be found

阅读更多关于 Airflow: dag_id could not be found

I'm running an airflow server and worker on different AWS machines. I've synced that dags folder between them, ran airflow initdb on both, and checked that the dag_id's are the same when I run airflow list_tasks <dag_id> When I run the scheduler and worker, I get this error on the worker: airflow.exceptions.AirflowException: dag_id could not be found: . Either the dag did not exist or it failed to parse. [...] Command ...--local -sd /home/ubuntu/airflow/dags/airflow_tutorial.py' What seems to be the problem is that the path there is wrong (/home/ubuntu/airflow/dags/airflow_tutorial.py) since

Airflow: Tasks queued but not running

阅读更多关于 Airflow: Tasks queued but not running

问题 I am new to airflow and trying to setup airflow to run ETL pipelines. I was able to install airflow postgres celery rabbitmq I am able to test run the turtorial dag. When i try to schedule the jobs, scheduler is able to pick it up and queue the jobs which i could see on the UI but tasks are not running. Could somebody help me fix ths issue? I believe i am missing most basic airflow concept here. below is the airflow.cfg Here is my config file: [core] airflow_home = /root/airflow dags_folder =