apache-airflow | 易学教程

How to obtain and process mysql records using Airflow?

阅读更多关于 How to obtain and process mysql records using Airflow?

I need to 1. run a select query on MYSQL DB and fetch the records. 2. Records are processed by python script. I am unsure about the way I should proceed. Is xcom the way to go here? Also, MYSQLOperator only executes the query, doesn't fetch the records. Is there any inbuilt transfer operator I can use? How can I use a MYSQL hook here? you may want to use a PythonOperator that uses the hook to get the data, apply transformation and ship the (now scored) rows back some other place. Can someone explain how to proceed regarding the same. Refer - http://markmail.org/message/x6nfeo6zhjfeakfe def do

Issues running airflow scheduler as a daemon process

阅读更多关于 Issues running airflow scheduler as a daemon process

问题 I have an EC2 instance that is running airflow 1.8.0 using LocalExecutor . Per the docs I would have expected that one of the following two commands would have raised the scheduler in daemon mode: airflow scheduler --daemon --num_runs=20 or airflow scheduler --daemon=True --num_runs=5 But that isn't the case. The first command seems like it's going to work, but it just returns the following output before returning to terminal without producing any background task: [2017-09-28 18:15:02,794] {_

Airflow : Passing a dynamic value to Sub DAG operator

阅读更多关于 Airflow : Passing a dynamic value to Sub DAG operator

I am new to Airflow. I have come across a scenario, where Parent DAG need to pass some dynamic number (let's say n ) to Sub DAG. Where as SubDAG will use this number to dynamically create n parallel tasks. Airflow documentation doesn't cover a way to achieve this. So I have explore couple of ways : Option - 1(Using xcom Pull) I have tried to pass as a xcom value, but for some reason SubDAG is not resolving to the passed value. Parent Dag File def load_dag(**kwargs): number_of_runs = json.dumps(kwargs['dag_run'].conf['number_of_runs']) dag_data = json.dumps({ "number_of_runs": number_of_runs })

Debugging Broken DAGs

阅读更多关于 Debugging Broken DAGs

When the airflow webserver shows up errors like Broken DAG: [<path/to/dag>] <error> , how and where can we find the full stacktrace for these exceptions? I tried these locations: /var/log/airflow/webserver -- had no logs in the timeframe of execution, other logs were in binary and decoding with strings gave no useful information. /var/log/airflow/scheduler -- had some logs but were in binary form, tried to read them and looked to be mostly sqlalchemy logs probably for airflow's database. /var/log/airflow/worker -- shows up the logs for running DAGs, (same as the ones you see on the airflow

Airflow: pass {{ ds }} as param to PostgresOperator

阅读更多关于 Airflow: pass {{ ds }} as param to PostgresOperator

i would like to use execution date as parameter to my sql file: i tried dt = '{{ ds }}' s3_to_redshift = PostgresOperator( task_id='s3_to_redshift', postgres_conn_id='redshift', sql='s3_to_redshift.sql', params={'file': dt}, dag=dag ) but it doesn't work. dt = '{{ ds }}' Doesn't work because Jinja (the templating engine used within airflow) does not process the entire Dag definition file. For each Operator there are fields which Jinja will process, which are part of the definition of the operator itself. In this case, you can make the params field (which is actually called parameters , make

In airflow, is there a good way to call another dag's task?

阅读更多关于 In airflow, is there a good way to call another dag's task?

I've got dag_prime and dag_tertiary. dag_prime : Scans through a directory and intends to call dag_tertiary on each one. Currently a PythonOperator. dag_tertiary : Scans through the directory passed to it and does (possibly time-intensive) calculations on the contents thereof. I can call the secondary one from a system call from the python operator, but i feel like there's got to be a better way. I'd also like to consider queuing the dag_tertiary calls, if there's a simple way to do that. Is there a better way than using system calls? Thanks! Him Use TriggerDagRunOperator for calling one DAG

Airflow authentication setups fails with “AttributeError: can't set attribute”

阅读更多关于 Airflow authentication setups fails with “AttributeError: can't set attribute”

The Airflow version 1.8 password authentication setup as described in the docs fails at the step user.password = 'set_the_password' with error AttributeError: can't set attribute It's better to simply use the new method of PasswordUser _set_password : # Instead of user.password = 'password' user._set_password = 'password' This is due to an update of SqlAlchemy to a version >= 1.2 that introduced a backwards incompatible change. You can fix this by explicitly installing a SqlAlchemy version <1.2. pip install 'sqlalchemy<1.2' Or in a requirement.txt sqlalchemy<1.2 Fixed with pip install

Tasks added to DAG during runtime fail to be scheduled

阅读更多关于 Tasks added to DAG during runtime fail to be scheduled

My idea is to have a task foo which generates a list of inputs (users, reports, log files, etc), and a task is launched for every element in the input list. The goal is to make use of Airflow's retrying and other logic, instead of reimplementing it. So, ideally, my DAG should look something like this: The only variable here is the number of tasks generated. I want to do some more tasks after all of these are completed, so spinning up a new DAG for every task does not seem appropriate. This is my code: default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2015, 6

Tasks added to DAG during runtime fail to be scheduled

阅读更多关于 Tasks added to DAG during runtime fail to be scheduled

问题 My idea is to have a task foo which generates a list of inputs (users, reports, log files, etc), and a task is launched for every element in the input list. The goal is to make use of Airflow's retrying and other logic, instead of reimplementing it. So, ideally, my DAG should look something like this: The only variable here is the number of tasks generated. I want to do some more tasks after all of these are completed, so spinning up a new DAG for every task does not seem appropriate. This is

Airflow: How to push xcom value from PostgreOperator?

阅读更多关于 Airflow: How to push xcom value from PostgreOperator?

I'm using Airflow 1.8.1 and I want to push the result of a sql request from PostgreOperator. Here's my tasks: check_task = PostgresOperator( task_id='check_task', postgres_conn_id='conx', sql="check_task.sql", xcom_push=True, dag=dag) def py_is_first_execution(**kwargs): value = kwargs['ti'].xcom_pull(task_ids='check_task') print 'count ----> ', value if value == 0: return 'next_task' else: return 'end-flow' check_branch = BranchPythonOperator( task_id='is-first-execution', python_callable=py_is_first_execution, provide_context=True, dag=dag) and here is my sql script: select count(1) from