airflow

Issues running airflow scheduler as a daemon process

眉间皱痕 提交于 2019-12-03 16:32:56
问题 I have an EC2 instance that is running airflow 1.8.0 using LocalExecutor . Per the docs I would have expected that one of the following two commands would have raised the scheduler in daemon mode: airflow scheduler --daemon --num_runs=20 or airflow scheduler --daemon=True --num_runs=5 But that isn't the case. The first command seems like it's going to work, but it just returns the following output before returning to terminal without producing any background task: [2017-09-28 18:15:02,794] {_

What is the difference between min_file_process_interval and dag_dir_list_interval in Apache Airflow 1.9.0?

别说谁变了你拦得住时间么 提交于 2019-12-03 16:05:19
We are using Airflow v 1.9.0. We have 100+ dags and the instance is really slow. The scheduler is only launching some tasks. In order to reduce the amount of CPU usage, we want to tweak some configuration parameters, namely: min_file_process_interval and dag_dir_list_interval . The documentation is not really clear about the difference between the two min_file_process_interval : In cases where there are only a small number of DAG definition files, the loop could potentially process the DAG definition files many times a minute. To control the rate of DAG file processing, the min_file_process

Install airflow package extras in PyCharm

懵懂的女人 提交于 2019-12-03 15:41:42
I want to use Airflow package extras s3 and postgres in PyCharm but do not know how to install them (on macOS Sierra). My attempts so far Airflow itself can be installed from Preferences > Project > Project interpreter > + but not the extras as far as I can work out. The extras can be installed with pip in the terminal using $ pip install airflow[s3,postgres] but they end up in a different interpreter ( ~/anaconda ) than the one used by PyCharm ( /usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7 ). Checking the Python executables in my /usr/local/bin directory I found

airflow TriggerDagRunOperator how to change the execution date

喜夏-厌秋 提交于 2019-12-03 15:32:48
I noticed that for scheduled task the execution date is set in the past according to Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available. however, when a dag triggers another dag the execution time is set to now(). Is there a way to have the triggered dags with the same execution time of triggering dag? Of course, I can rewrite the template and use yesterday_ds, however, this is a tricky

Using Dataflow vs. Cloud Composer

假装没事ソ 提交于 2019-12-03 15:03:53
问题 I apologize for this naive question, but I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation. Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic processing -- and load it into BigQuery. Let me give a very basic example: # file.csv type\x01date house\x0112/27/1982 car\x0111/9/1889 From this file we detect the schema and create a BigQuery table, something

Airflow : Passing a dynamic value to Sub DAG operator

断了今生、忘了曾经 提交于 2019-12-03 15:00:00
I am new to Airflow. I have come across a scenario, where Parent DAG need to pass some dynamic number (let's say n ) to Sub DAG. Where as SubDAG will use this number to dynamically create n parallel tasks. Airflow documentation doesn't cover a way to achieve this. So I have explore couple of ways : Option - 1(Using xcom Pull) I have tried to pass as a xcom value, but for some reason SubDAG is not resolving to the passed value. Parent Dag File def load_dag(**kwargs): number_of_runs = json.dumps(kwargs['dag_run'].conf['number_of_runs']) dag_data = json.dumps({ "number_of_runs": number_of_runs })

Apache Airflow - trigger/schedule DAG rerun on completion (File Sensor)

风流意气都作罢 提交于 2019-12-03 14:42:42
Good Morning. I'm trying to setup a DAG too Watch/sense for a file to hit a network folder Process the file Archive the file Using the tutorials online and stackoverflow I have been able to come up with the following DAG and Operator that successfully achieves the objectives, however I would like the DAG to be rescheduled or rerun on completion so it starts watching/sensing for another file. I attempted to set a variable max_active_runs:1 and then a schedule_interval: timedelta(seconds=5) this yes reschedules the DAG but starts queuing task and locks the file. Any ideas welcome on how I could

Airflow inside docker running a docker container

女生的网名这么多〃 提交于 2019-12-03 13:46:29
问题 I have airflow running on an EC2 instance, and I am scheduling some tasks that spin up a docker container. How do I do that? Do I need to install docker on my airflow container? And what is the next step after. I have a yaml file that I am using to spin up the container, and it is derived from the puckel/airflow Docker image 回答1: Finally resolved My EC2 setup is running unbuntu Xenial 16.04 and using a modified the puckel/airflow docker image that is running airflow Things you will need to

How to integrate Apache Airflow with slack?

荒凉一梦 提交于 2019-12-03 13:13:26
问题 could someone please give me step by step manual on how to connect Apache Airflow to Slack workspace. I created webhook to my channel and what should I do with it next ? Kind regards 回答1: Create a Slack Token from https://api.slack.com/custom-integrations/legacy-tokens Use the SlackAPIPostOperator Operator in your DAG as below SlackAPIPostOperator( task_id='failure', token='YOUR_TOKEN', text=text_message, channel=SLACK_CHANNEL, username=SLACK_USER) The above is the simplest way you can use

setting up airflow with bigquery operator

旧巷老猫 提交于 2019-12-03 10:48:19
问题 I am experimenting with airflow for data pipelines. I unfortunately cannot get it to work with the bigquery operator so far. I have searched for a solution to the best of my ability but I am still stuck.. I am using the sequential executor running locally. Here is my code: from airflow import DAG from airflow.contrib.operators.bigquery_operator import BigQueryOperator from datetime import datetime, timedelta default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime