airflow | 易学教程

How >> operator defines task dependencies in Airflow?

阅读更多关于 How >> operator defines task dependencies in Airflow?

问题 I was going through Apache Airflow tutorial https://github.com/hgrif/airflow-tutorial and encountered this section for defining task dependencies. with DAG('airflow_tutorial_v01', default_args=default_args, schedule_interval='0 * * * *', ) as dag: print_hello = BashOperator(task_id='print_hello', bash_command='echo "hello"') sleep = BashOperator(task_id='sleep', bash_command='sleep 5') print_world = PythonOperator(task_id='print_world', python_callable=print_world) print_hello >> sleep >>

How >> operator defines task dependencies in Airflow?

阅读更多关于 How >> operator defines task dependencies in Airflow?

AIrflow - Splitting DAG definition across multiple files

阅读更多关于 AIrflow - Splitting DAG definition across multiple files

问题 Just getting started with Airflow and wondering what best practices are for structuring large DAGs. For our ETL, we have a lots of tasks that fall into logical groupings, yet the groups are dependent on each other. Which of the following would be considered best practice? One large DAG file with all tasks in that file Splitting the DAG definition across multiple files (How to do this?) Define multiple DAGs, one for each group of tasks, and set dependencies between them using

Airflow 1.10.0 via Ansible

阅读更多关于 Airflow 1.10.0 via Ansible

问题 Below is my Ansible code which is trying to install Airflow 1.10.0. sudo journalctl -u airflow-webserver -e output is Dec 31 12:13:48 ip-10-136-94-232.eu-central-1.compute.internal airflow[22224]: ProgrammingError: (_mysql_exceptions.ProgrammingError) (1146, "Table 'airflow.log' doesn't exist") [SQL: u'INSERT INTO log (dttm, dag_id, sudo journalctl -u airflow-scheduler -e output is Dec 31 12:14:19 ip-10-136-94-232.eu-central-1.compute.internal airflow[22307]: ProgrammingError: (_mysql

Airflow starts two DAG runs when turned on for the first time

阅读更多关于 Airflow starts two DAG runs when turned on for the first time

问题 When I boot up the Airflow webserver and scheduler for the first time on Oct 25th at around 17:23, and turn on my DAG, I can see that it kicks off two runs for Oct 23rd and Oct 24th: RUN 1 -> 10-23T17:23 RUN 2 -> 10-24T17:23 Here's my DAG configuration: default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': '2019-01-01', 'retries': 0, } dag = DAG( 'my_script', default_args=default_args, schedule_interval=datetime.timedelta(days=1), catchup=False, ) Since it's past the

How do I clear the state of a dag run with the CLI in airflow/composer?

阅读更多关于 How do I clear the state of a dag run with the CLI in airflow/composer?

问题 I thought I could use the command: g beta composer environments run <env> --location=us-central1 clear -- <dag_id> -s 2018-05-13 -e 2018-05-14 the clear the state of the dag runs on 2018-05-13. For some reason it doesn't work. What happens is that the CLI hangs on a message like: kubeconfig entry generated for <kube node name>. What is the expected behavior of the command above? I would expect it to clear the dag run for the interval, but I might be doing something wrong. 回答1: Running clear

Suggestion for scheduling tool(s) for building hadoop based data pipelines

阅读更多关于 Suggestion for scheduling tool(s) for building hadoop based data pipelines

问题 Between Apache Oozie, Spotify/Luigi and airbnb/airflow, what are the pros and cons for each of them? I have used oozie and airflow in the past for building a data ingestion pipeline using PIG and Hive. Currently, I am in the process of building a pipeline that looks at logs and extracts out useful events and puts them on redshift. I found that airflow was much easier to use/test/setup. It has a much cooler UI and lets users perform actions from the UI itself, which is not the case with Oozie.

How do I trigger an Airflow DAG via the REST API?

阅读更多关于 How do I trigger an Airflow DAG via the REST API?

问题 The 1.10.0 documentation says I should be able to make a POST against /api/experimental/dags//dag_runs to trigger a DAG run, but instead when I do this, I receive an error: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <title>400 Bad Request</title> <h1>Bad Request</h1> <p>The browser (or proxy) sent a request that this server could not understand.</p> 回答1: To make this work, I figured out that I needed to send an empty JSON string in the body: curl -X POST \ http://airflow.dyn.fa

How to parse json string in airflow template

阅读更多关于 How to parse json string in airflow template

问题 Is it possible to parse JSON string inside an airflow template? I have a HttpSensor which monitors a job via a REST API, but the job id is in the response of the upstream task which has xcom_push marked True . I would like to do something like the following, however, this code gives the error jinja2.exceptions.UndefinedError: 'json' is undefined t1 = SimpleHttpOperator( http_conn_id="s1", task_id="job", endpoint="some_url", method='POST', data=json.dumps({ "foo": "bar" }), xcom_push=True, dag

Running more than 32 concurrent tasks in Apache Airflow

阅读更多关于 Running more than 32 concurrent tasks in Apache Airflow

问题 I'm running Apache Airflow 1.8.1. I would like to run more than 32 concurrent tasks on my instance, but cannot get any of the configurations to work. I am using the CeleryExecutor, the Airflow config in the UI shows 64 for parallelism and dag_concurrency and I've restarted the Airflow scheduler, web server and workers numerous times (I'm actually testing this locally in a Vagrant machine, but have also tested in on an EC2 instance). airflow.cfg # The amount of parallelism as a setting to the