airflow

Proper way to create dynamic workflows in Airflow

↘锁芯ラ 提交于 2019-11-28 02:59:05
Problem Is there any way in Airflow to create a workflow such that the number of tasks B.* is unknown until completion of Task A? I have looked at subdags but it looks like it can only work with a static set of tasks that have to be determined at Dag creation. Would dag triggers work? And if so could you please provide an example. I have an issue where it is impossible to know the number of task B's that will be needed to calculate Task C until Task A has been completed. Each Task B.* will take several hours to compute and cannot be combined. |---> Task B.1 --| |---> Task B.2 --| Task A ------

How to handle different task intervals on a single Dag in airflow?

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-28 01:17:19
问题 I have a single dag with multiple tasks with this simple structure that tasks A, B, and C can run at the start without any dependencies but task D depends on A no here is my question: tasks A, B, and C run daily but I need task D to run weekly after A succeeds. how can I setup this dag? does changing schedule_interval of task work? Is there any best practice to this problem? Thanks for your help. 回答1: You can use a ShortCircuitOperator to do this. import airflow from airflow.operators.python

Centos7 安装部署 Airflow

青春壹個敷衍的年華 提交于 2019-11-28 00:56:18
本人在centos7 的环境下部署,怎么在centos7 下配置静态 IP 关闭防火墙 以及安装jdk在这里不多赘述, centos7 配置静态ip可以参考: https://www.cnblogs.com/braveym/p/8523100.html 和 https://www.cnblogs.com/braveym/p/9096402.html Airflow 基础安装 1、默认自带python2环境,自行安装pip sudo yum -y install epel-release sduo yum install python-pip 2、进行pip的更新,否则很多安装会报错 sudo pip install --upgrade pip sudo pip install --upgrade setuptools 3、安装开发库 sudo yum install python-devel sudo yum install libevent-devel sudo yum install mysql-devel 4、安装centos7下的mysql 在安装mysql前建议先更新一下源,不然下载老失败 打开centos的yum文件夹(需要root权限或者拥有该目录的操作权限才可以) 输入命令cd /etc/yum.repos.d 用wget下载repo文件 输入命令sudo wget

BashOperator doen't run bash file apache airflow

☆樱花仙子☆ 提交于 2019-11-27 22:20:16
问题 I just started using apache airflow. I am trying to run test.sh file from airflow, however it is not work. Following is my code, file name is test.py import os from airflow import DAG from airflow.operators.bash_operator import BashOperator from datetime import datetime, timedelta default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2015, 6, 1), 'email': ['airflow@airflow.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay':

How to prevent airflow from backfilling dag runs?

假装没事ソ 提交于 2019-11-27 21:48:08
Say you have an airflow DAG that doesn't make sense to backfill, meaning that, after it's run once, running it subsequent times quickly would be completely pointless. For example, if you're loading data from some source that is only updated hourly into your database, backfilling, which occurs in rapid succession, would just be importing the same data again and again. This is especially annoying when you instantiate a new hourly task, and it runs N amount of times for each hour it missed, doing redundant work, before it starts running on the interval you specified. The only solution I can think

How to dynamically iterate over the output of an upstream task to create parallel tasks in airflow?

浪子不回头ぞ 提交于 2019-11-27 21:41:28
Consider the following example of a DAG where the first task, get_id_creds , extracts a list of credentials from a database. This operation tells me what users in my database I am able to run further data preprocessing on and it writes those ids to the file /tmp/ids.txt . I then scan those ids into my DAG and use them to generate a list of upload_transaction tasks that can be run in parallel. My question is: Is there a more idiomatically correct, dynamic way to do this using airflow? What I have here feels clumsy and brittle. How can I directly pass a list of valid IDs from one process to that

Airflow - How to pass xcom variable into Python function

大兔子大兔子 提交于 2019-11-27 20:13:26
问题 I need to reference a variable that's returned by a BashOperator . I may be doing this wrong so please forgive me. In my task_archive_s3_file , I need to get the filename from get_s3_file . The task simply prints {{ ti.xcom_pull(task_ids=submit_file_to_spark) }} as a string instead of the value. If I use the bash_command , the value prints correctly. get_s3_file = PythonOperator( task_id='get_s3_file', python_callable=obj.func_get_s3_file, trigger_rule=TriggerRule.ALL_SUCCESS, dag=dag) submit

Airflow tasks get stuck at “queued” status and never gets running

六眼飞鱼酱① 提交于 2019-11-27 17:19:38
问题 I'm using Airflow v1.8.1 and run all components (worker, web, flower, scheduler) on kubernetes & Docker. I use Celery Executor with Redis and my tasks are looks like: (start) -> (do_work_for_product1) ├ -> (do_work_for_product2) ├ -> (do_work_for_product3) ├ … So the start task has multiple downstreams. And I setup concurrency related configuration as below: parallelism = 3 dag_concurrency = 3 max_active_runs = 1 Then when I run this DAG manually (not sure if it never happens on a scheduled

Airflow “This DAG isnt available in the webserver DagBag object ”

本小妞迷上赌 提交于 2019-11-27 13:52:00
when I put a new DAG python script in the dags folder, I can view a new entry of DAG in the DAG UI but it was not enabled automatically. On top of that, it seems does not loaded properly as well. I can only click on the Refresh button few times on the right side of the list and toggle the on/off button on the left side of the list to be able to schedule the DAG. These are manual process as I need to trigger something even though the DAG Script was put inside the dag folder. Anyone can help me on this ? Did I missed something ? Or this is a correct behavior in airflow ? By the way, as mentioned

How to create a conditional task in Airflow

夙愿已清 提交于 2019-11-27 11:27:23
I would like to create a conditional task in Airflow as described in the schema below. The expected scenario is the following: Task 1 executes If Task 1 succeed, then execute Task 2a Else If Task 1 fails, then execute Task 2b Finally execute Task 3 All tasks above are SSHExecuteOperator. I'm guessing I should be using the ShortCircuitOperator and / or XCom to manage the condition but I am not clear on how to implement that. Could you please describe the solution? You have to use airflow trigger rules All operators have a trigger_rule argument which defines the rule by which the generated task