airflow

Create dynamic pool in Airflow

耗尽温柔 提交于 2019-12-13 00:24:55
问题 I have a DAG that creates a cluster, starts computation tasks, and after they completed, tears down the cluster. I want to limit concurrency for the computation tasks carried on this cluster to fixed number. So logically, I need a pool that is exclusive to the cluster created by a task. I don't want interference with other DAGs or different runs of the same DAG. I thought I could solve this problem by creating a pool dynamically from a task after the cluster is created and delete it once the

Airflow EC2-Instance socket.getfqdn() Bug

爷,独闯天下 提交于 2019-12-13 00:16:52
问题 I'm using Airflow version 1.9 and there is a bug in their software that you can read about here on my previous Stackoverflow post, as well as here on another one of my Stackoverflow posts, and here on Airflow's Github where the bug is reported and discussed. Long story short there are a few locations in Airflow's code where it needs to get the IP address of the server. They accomplish this by running this command: socket.getfqdn() The problem is that on Amazon EC2-Instances (Amazon Linux 1)

Can airflow be used to run a never ending task?

♀尐吖头ヾ 提交于 2019-12-12 23:40:46
问题 Can we use an airflow dag to define a never-ending job (ie. a task which has a unconditional loop to consume stream data) by setting the task/dag timeout to None and manually trigger its running? Would having airflow monitor a never ending task cause a problem? Thanks 回答1: A bit odd to run this through Airflow, but yeah I don't think that's an issue. Just note that if you restart the worker running the job (assuming CeleryExecutor), you'll interrupt the task and need to kick it off manually

Airflow PostgresHook returning an ID from an Insert statement not committing

纵然是瞬间 提交于 2019-12-12 22:22:39
问题 I am using PostgresHook in an Airflow operator. pg_hook = PostgresHook(postgres_conn_id='postgres_default') insert_activities_sql = "INSERT INTO activities (---) VALUES (---) RETURNING id " activity_results = pg_hook.get_first(insert_activities_sql,parameters=insert_activities_params) This does return the Id but the record is not committed into the activities table. I have tried running get_records and get_first and neither commit. .run commits but does not return the results id. Is this the

Airflow: Retry up to a specific time

扶醉桌前 提交于 2019-12-12 19:23:44
问题 I need to create an Airflow job that needs to run absolutely before 9h. I currently have a job that starts at 7h, with retries=8 with 15 minutes interval (8*15m=2h) unfortunately, my job takes more time, and due to this, the task fails after 9h that is the hard deadline. How can I make it do retry every 15 minutes but fail if it's after 9h so a human can take a look at the issue ? Thanks for your help 回答1: You could use the execution_timeout argument when creating the task to control how long

Error using airflow's DataflowPythonOperator to schedule dataflow job

荒凉一梦 提交于 2019-12-12 12:54:40
问题 I am trying to schedule dataflow jobs using airflow's DataflowPythonOperator. Here is my dag operator: test = DataFlowPythonOperator( task_id = 'my_task', py_file = 'path/my_pyfile.py', gcp_conn_id='my_conn_id', dataflow_default_options={ "project": 'my_project', "runner": "DataflowRunner", "job_name": 'my_job', "staging_location": 'gs://my/staging', "temp_location": 'gs://my/temping', "requirements_file": 'path/requirements.txt' } ) The gcp_conn_id has been setup and it could work. And the

Airflow: how to extend SubDagOperator?

筅森魡賤 提交于 2019-12-12 11:19:47
问题 When I try to extend the SubDagOperator provided in airflow API, airflow webserver GUI does not recognize it as SubDagOperator thereby disabling me to zoom in to the subdag. How can I extend the SubDagOperator while preserving the ability to zoom in to it as a subdag? Am I missing something? 回答1: Please see the example below on how to extend the SubDagOperator. The key in your case was to override the task_type function from airflow import DAG from airflow.operators.subdag_operator import

airflow TimeDeltaSensor fails with unsupported operand type

旧街凉风 提交于 2019-12-12 10:57:45
问题 In my DAG I have a TimeDeltaSensor created using: from datetime import datetime, timedelta from airflow.operators.sensors import TimeDeltaSensor wait = TimeDeltaSensor( task_id='wait', delta=timedelta(seconds=300), dag=dag ) However when it runs I get error Subtask: [2018-07-13 09:00:39,663] {models.py:1427} ERROR - unsupported operand type(s) for +=: 'NoneType' and 'datetime.timedelta' Airflow version is 1.8.1. The code is basically lifted from Example Pipeline definition so I'm nonplussed

How can you re-run upstream task if a downstream task fails in Airflow (using Sub Dags)

馋奶兔 提交于 2019-12-12 10:52:55
问题 I have an airflow dag that extracts data and performs validation. If the validation fails, it needs to re-run the extract. If the validation succeeds its continues. I've read people saying that sub dags can solve this problem, but I can't see any example of this. I've tried using a sub dag, but come across the same problem as trying to do it in one DAG. How can I get all tasks in the Sub DAG to re-run if one of them fails? I have the following DAG/sub dag details: maindag.py default_args = {