airflow

How >> operator defines task dependencies in Airflow?

 ̄綄美尐妖づ 提交于 2019-12-23 08:08:23
问题 I was going through Apache Airflow tutorial https://github.com/hgrif/airflow-tutorial and encountered this section for defining task dependencies. with DAG('airflow_tutorial_v01', default_args=default_args, schedule_interval='0 * * * *', ) as dag: print_hello = BashOperator(task_id='print_hello', bash_command='echo "hello"') sleep = BashOperator(task_id='sleep', bash_command='sleep 5') print_world = PythonOperator(task_id='print_world', python_callable=print_world) print_hello >> sleep >>

How >> operator defines task dependencies in Airflow?

£可爱£侵袭症+ 提交于 2019-12-23 08:08:22
问题 I was going through Apache Airflow tutorial https://github.com/hgrif/airflow-tutorial and encountered this section for defining task dependencies. with DAG('airflow_tutorial_v01', default_args=default_args, schedule_interval='0 * * * *', ) as dag: print_hello = BashOperator(task_id='print_hello', bash_command='echo "hello"') sleep = BashOperator(task_id='sleep', bash_command='sleep 5') print_world = PythonOperator(task_id='print_world', python_callable=print_world) print_hello >> sleep >>

AIrflow - Splitting DAG definition across multiple files

删除回忆录丶 提交于 2019-12-23 07:05:35
问题 Just getting started with Airflow and wondering what best practices are for structuring large DAGs. For our ETL, we have a lots of tasks that fall into logical groupings, yet the groups are dependent on each other. Which of the following would be considered best practice? One large DAG file with all tasks in that file Splitting the DAG definition across multiple files (How to do this?) Define multiple DAGs, one for each group of tasks, and set dependencies between them using

Airflow 1.10.0 via Ansible

一个人想着一个人 提交于 2019-12-23 04:09:58
问题 Below is my Ansible code which is trying to install Airflow 1.10.0. sudo journalctl -u airflow-webserver -e output is Dec 31 12:13:48 ip-10-136-94-232.eu-central-1.compute.internal airflow[22224]: ProgrammingError: (_mysql_exceptions.ProgrammingError) (1146, "Table 'airflow.log' doesn't exist") [SQL: u'INSERT INTO log (dttm, dag_id, sudo journalctl -u airflow-scheduler -e output is Dec 31 12:14:19 ip-10-136-94-232.eu-central-1.compute.internal airflow[22307]: ProgrammingError: (_mysql

Airflow starts two DAG runs when turned on for the first time

烂漫一生 提交于 2019-12-23 04:04:53
问题 When I boot up the Airflow webserver and scheduler for the first time on Oct 25th at around 17:23, and turn on my DAG, I can see that it kicks off two runs for Oct 23rd and Oct 24th: RUN 1 -> 10-23T17:23 RUN 2 -> 10-24T17:23 Here's my DAG configuration: default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': '2019-01-01', 'retries': 0, } dag = DAG( 'my_script', default_args=default_args, schedule_interval=datetime.timedelta(days=1), catchup=False, ) Since it's past the

How do I clear the state of a dag run with the CLI in airflow/composer?

别来无恙 提交于 2019-12-23 00:52:12
问题 I thought I could use the command: g beta composer environments run <env> --location=us-central1 clear -- <dag_id> -s 2018-05-13 -e 2018-05-14 the clear the state of the dag runs on 2018-05-13. For some reason it doesn't work. What happens is that the CLI hangs on a message like: kubeconfig entry generated for <kube node name>. What is the expected behavior of the command above? I would expect it to clear the dag run for the interval, but I might be doing something wrong. 回答1: Running clear

Suggestion for scheduling tool(s) for building hadoop based data pipelines

青春壹個敷衍的年華 提交于 2019-12-22 17:54:56
问题 Between Apache Oozie, Spotify/Luigi and airbnb/airflow, what are the pros and cons for each of them? I have used oozie and airflow in the past for building a data ingestion pipeline using PIG and Hive. Currently, I am in the process of building a pipeline that looks at logs and extracts out useful events and puts them on redshift. I found that airflow was much easier to use/test/setup. It has a much cooler UI and lets users perform actions from the UI itself, which is not the case with Oozie.

How do I trigger an Airflow DAG via the REST API?

帅比萌擦擦* 提交于 2019-12-22 17:51:14
问题 The 1.10.0 documentation says I should be able to make a POST against /api/experimental/dags//dag_runs to trigger a DAG run, but instead when I do this, I receive an error: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <title>400 Bad Request</title> <h1>Bad Request</h1> <p>The browser (or proxy) sent a request that this server could not understand.</p> 回答1: To make this work, I figured out that I needed to send an empty JSON string in the body: curl -X POST \ http://airflow.dyn.fa

How to parse json string in airflow template

时间秒杀一切 提交于 2019-12-22 16:41:47
问题 Is it possible to parse JSON string inside an airflow template? I have a HttpSensor which monitors a job via a REST API, but the job id is in the response of the upstream task which has xcom_push marked True . I would like to do something like the following, however, this code gives the error jinja2.exceptions.UndefinedError: 'json' is undefined t1 = SimpleHttpOperator( http_conn_id="s1", task_id="job", endpoint="some_url", method='POST', data=json.dumps({ "foo": "bar" }), xcom_push=True, dag

Running more than 32 concurrent tasks in Apache Airflow

牧云@^-^@ 提交于 2019-12-22 13:48:17
问题 I'm running Apache Airflow 1.8.1. I would like to run more than 32 concurrent tasks on my instance, but cannot get any of the configurations to work. I am using the CeleryExecutor, the Airflow config in the UI shows 64 for parallelism and dag_concurrency and I've restarted the Airflow scheduler, web server and workers numerous times (I'm actually testing this locally in a Vagrant machine, but have also tested in on an EC2 instance). airflow.cfg # The amount of parallelism as a setting to the