airflow

“Invalid arguments passed” error for dag that loads mysql data to bigquery using airflow

孤人 提交于 2019-12-11 02:53:37
问题 I running a DAG that extracts MySQL data and loads it to BigQuery in airflow. I am currectly getting the following error: /usr/local/lib/python2.7/dist-packages/airflow/models.py:1927: PendingDeprecationWarning: Invalid arguments were passed to MySqlToGoogleCloudStorageOperator. Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were: *args: () **kwargs: {'google_cloud_storage_connn_id': 'podioGCPConnection'} category=PendingDeprecationWarning /usr/local/lib

apache-airflow systemd file using a conda env

你。 提交于 2019-12-11 02:06:50
问题 I'm trying to run apache-airflow on a Ubuntu 16.04 file, using systemd. I roughly followed this tutorial and installed/setup the following: Miniconda 2, 64-bit Installed gcc ( sudo apt-get install gcc ) Conda environment, using the yml file of the tutorial Within the following conda environment: export AIRFLOW_HOME="/home/ubuntu/airflow" When I test Airflow, everything works fine: airflow webserver --port 8080 But whenever I try to launch airflow using a systemd file, it fails. The systemd

Dynamically create list of tasks

孤者浪人 提交于 2019-12-11 01:14:16
问题 I have a DAG which is created by querying DynamoDB for a list and for each item in the list a task is created using a PythonOperator and adding it to the DAG. Not show in the example below but it's important to note that some of the items on the list depend upon other tasks so I'm using set_upstream to enforce the dependencies. - airflow_home \- dags \- workflow.py workflow.py def get_task_list(): # ... query dynamodb ... def run_task(task): # ... do stuff ... dag = DAG(dag_id='my_dag', ...)

understanding the tree view in apache airflow

你离开我真会死。 提交于 2019-12-11 01:06:00
问题 I setup the dag from the https://airflow.apache.org/tutorial.html as is, the only change being that I have set the dag to run at an interval of 5 minutes with a start date of 2017-12-17 T13:40:00 UTC. I enabled the dag before 13:40, so there was no backfill and my machine is running on UTC. The dag ran as expected(i.e at an interval of 5 minutes starting at 13:45 UTC) Now, when I go to the tree view, I am failing to understand the graph. There are 3 tasks in total. 'sleep'(t2) has upstream

Setting up S3 logging in Airflow

大兔子大兔子 提交于 2019-12-11 00:07:47
问题 This is driving me nuts. I'm setting up airflow in a cloud environment. I have one server running the scheduler and the webserver and one server as a celery worker, and I'm using airflow 1.8.0. Running jobs works fine. What refuses to work is logging. I've set up the correct path in airflow.cfg on both servers: remote_base_log_folder = s3://my-bucket/airflow_logs/ remote_log_conn_id = s3_logging_conn I've set up s3_logging_conn in the airflow UI, with the access key and the secret key as

Airflow force re-run of upstream task when cleared even though downstream if marked success

China☆狼群 提交于 2019-12-10 23:50:40
问题 I have tasks A -> B -> C in Airflow and when I run the DAG and all complete with success, I'd like to be able to clear B alone (while leaving C marked as success). B clears and gets put into the 'no_status' state but then when I try to re-run B, nothing happens. I've tried --ignore_dependencies, --ignore_depends_on_past and --force but to no avail. B seems to only re-run if C is also cleared and then everything re-runs as expected. The reason why I'd like to be able re-run B specifically

Airflow cron expression is not scheduling dag properly

淺唱寂寞╮ 提交于 2019-12-10 20:54:16
问题 I am exploring Airflow to be used as a cron so that I could use its other features while setting up the cron. I was testing its functionality by setting cron like "2,3,5,8, * * * *" . I was expecting particular dag to be scheduled on minute 2,3,5 and 8 of every hour. However in reality the dag for 2nd minute is executed on 3rd, for 3rd on 5th and for 5th on 8th. And it is not executed for 8th at all. I guess it would be executed for 8th on 2nd minute of the next hour. Looks like some kind of

Pass other arguments to on_failure_callback

半腔热情 提交于 2019-12-10 18:58:24
问题 I'd like to pass other arguments to my on_failure_callback function but it only seems to want "context". How do I pass other arguments to that function...especially since I'd like to define that function in a separate module so it can be used in all my DAGS. My current default_args looks like this: default_args = { 'owner': 'Me', 'depends_on_past': True, 'start_date': datetime(2016,01,01), 'email': ['me@me.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay':

How do I set an environment variable for airflow to use?

白昼怎懂夜的黑 提交于 2019-12-10 17:05:49
问题 Airflow is returning an error when trying to run a DAG saying that it can't find an environment variable, which is odd because it's able to find 3 other environment variables that I'm storing as a Python variable. No issues with those variables at all. I have all 4 variables in ~/.profile and have also done export var1="varirable1" export var2="varirable2" export var3="varirable3" export var4="varirable4" Under what user does airflow run? I've done those export commands under sudo as well, so

How can I restart the airflow server on Google Composer?

我是研究僧i 提交于 2019-12-10 15:55:00
问题 When I need to restart the webserver locally I do: ps -ef | grep airflow | awk '{print $2}' | xargs kill -9 airflow webserver -p 8080 -D How can I do this on Google Composer? I don't see an option to restart the server in the console. 回答1: Since Cloud Composer is an Apache Airflow managed service, it is not possible to restart the whole service. You can restart though the single instances of the service, as described here, but this will not help to apply to the plugin changes. To apply the