airflow

Airflow latency between tasks

时光毁灭记忆、已成空白 提交于 2019-12-05 12:10:14
As you can see in the image : airflow is making too much time between tasks execution ? it almost represents 30% of the DAG execution time. I've changed the airflow.cfg file to: job_heartbeat_sec = 1 scheduler_heartbeat_sec = 1 but I still have the same latency rate. Why does it behave this way ? It is by design. For instance I use Airflow to perform large workflows where some tasks can take a really long time. Airflow is not meant for tasks that will take seconds to execute, it can be used for that of course but might not be the most suitable tool. With that said there is not much that you

Generating dynamic tasks in airflow based on output of an upstream task

[亡魂溺海] 提交于 2019-12-05 11:57:56
How to generate tasks dynamically based on the list returned from an upstream task. I have tried the following: Using an external file to write and read from the list - this option works but I am looking for a more elegant solution. Xcom pull inside a subdag factory. It does not work. I am able to pass a list from the upstream task to a subdag but that xcom is only accessible inside of a subdag's task and cannot be used to loop/iterate over the returned list and generate tasks. for e.g. subdag factory method. def subdag1(parent_dag_name, child_dag_name, default_args,**kwargs): dag_subdag = DAG

For Apache Airflow, How can I pass the parameters when manually trigger DAG via CLI?

扶醉桌前 提交于 2019-12-05 11:49:50
I use Airflow to manage ETL tasks execution and schedule. A DAG has been created and it works fine. But is it possible to pass parameters when manually trigger the dag via cli. For example: My DAG runs every day at 01:30, and processes data for yesterday(time range from 01:30 yesterday to 01:30 today). There might be some issues with the data source. I need to re-process those data (manually specify the time range). So can I create such an airflow DAG, when it's scheduled, that the default time range is from 01:30 yesterday to 01:30 today. Then if anything wrong with the data source, I need to

How to really create n tasks in a SubDAG based on the result of a previous task

断了今生、忘了曾经 提交于 2019-12-05 11:35:43
I'm creating a dynamic DAG in Airflow using SubDAGs. The thing I need is that the number of tasks inside the SubDAG is determined by the result of a previous task (the subtask_ids variable of the middle_section function should be the same variable of the initial_task function). The thing is that I can't access xcom inside the subdag function of a SubDagOperator because I haven't any context. Also, I can't reach to any DB for reading some value because of the autodiscovery DAG feature of the scheduler: the middle_section is executed every few seconds. How do you guys solve this? Create a

Example DAG gets stuck in “running” state indefinitely

不问归期 提交于 2019-12-05 09:38:25
问题 In my first foray into airflow, I am trying to run one of the example DAGS that comes with the installation. This is v.1.8.0. Here are my steps: $ airflow trigger_dag example_bash_operator [2017-04-19 15:32:38,391] {__init__.py:57} INFO - Using executor SequentialExecutor [2017-04-19 15:32:38,676] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags [2017-04-19 15:32:38,947] {cli.py:185} INFO - Created <DagRun example_bash_operator @ 2017-04-19 15:32

airflow cleared tasks not getting executed

谁说胖子不能爱 提交于 2019-12-05 09:19:40
Preamble Yet another airflow tasks not getting executed question... Everything was going more or less fine in my airflow experience up until this weekend when things really went downhill. I have checked all the standard things e.g. as outlined in this helpful post . I have reset the whole instance multiple times trying to get it working properly but I am totally losing the battle here. Environment version: airflow 1.10.2 os: centos 7 python: python 3.6 virtualenv: yes executor: LocalExecutor backend db: mysql The problem Here's what happens in my troubleshooting infinite loop / recurring

Running an Airflow DAG every X minutes

我的未来我决定 提交于 2019-12-05 08:13:19
I am using airflow on an EC2 instance using the LocalScheduler option. I've invoked airflow scheduler and airflow webserver and everything seems to be running fine. That said, after supplying the cron string to schedule_interval for "do this every 10 minutes," '*/10 * * * *' , the job continue to execute every 24 hours by default. Here's the header of the code: from datetime import datetime import os import sys from airflow.models import DAG from airflow.operators.python_operator import PythonOperator import ds_dependencies SCRIPT_PATH = os.getenv('PREPROC_PATH') if SCRIPT_PATH: sys.path

Airflow File Sensor for sensing files on my local drive

淺唱寂寞╮ 提交于 2019-12-05 07:43:01
does anybody have any idea on FileSensor ? I came through it while i was researching on sensing files on my local directory. The code is as follows: task= FileSensor( task_id="senseFile" filepath="etc/hosts", fs_conn_id='fs_local', _hook=self.hook, dag=self.dag,) I have also set my conn_id and conn type as File (path) and gave the {'path':'mypath'} but even though i set a non existing path or if the file isnt there in the specified path, the task is completed and the dag is successful. The FileSensor doesnt seem to sense files at all. I found the community contributed FileSenor a little bit

Cannot access airflow web server via AWS load balancer HTTPS because airflow redirects me to HTTP

落花浮王杯 提交于 2019-12-05 07:40:54
I have an airflow web server configured at EC2, it listens at port 8080. I have an AWS ALB(application load balancer) in front of the EC2, listen at https 80 (facing internet) and instance target port is facing http 8080. I cannot surf https://< airflow link > from browser because the airflow web server redirects me to http : //< airflow link >/admin, which the ALB does not listen at. If I surf https://< airflow link > /admin/airflow/login?next=%2Fadmin%2F from browser, then I see the login page because this link does not redirect me. My question is how to change airflow so that when surfing

Export environment variables at runtime with airflow

早过忘川 提交于 2019-12-05 07:26:03
I am currently converting workflows that were implemented in bash scripts before to Airflow DAGs. In the bash scripts, I was just exporting the variables at run time with export HADOOP_CONF_DIR="/etc/hadoop/conf" Now I'd like to do the same in Airflow, but haven't found a solution for this yet. The one workaround I found was setting the variables with os.environ[VAR_NAME]='some_text' outside of any method or operator, but that means they get exported the moment the script gets loaded, not at run time. Now when I try to call os.environ[VAR_NAME] = 'some_text' in a function that gets called by a