airflow

ECS Airflow 1.10.2 performance issues. Operators and tasks take 10x longer

北慕城南 提交于 2019-12-06 08:25:34
问题 We moved to puckel/Airflow-1.10.2 to try and resolve a poor performance we've had in multiple environments. We are running on ECS Airflow 1.10.2 on AWS ECS. Interestingly, the CPU/mem never jump above 80%. The Airflow metadb stays very underutilized as well. Below I've listed the configuration we're using, the DagBag parsing time plus the detailed execution times from the cProfile output of just running DagBag() in pure Python. A few of our DAGs import a function from create_subdag_functions

Cloud Composer (Airflow) jobs stuck

允我心安 提交于 2019-12-06 08:03:19
My Cloud Composer managed Airflow got stuck for hours since I've canceled a Task Instance that was taking too long (Let's call it Task A) I've cleared all the DAG Runs and task instances, but there are a few jobs running and one job with Shutdown state (I suppose the job of Task A) ( snapshot of my Jobs ). Besides, it seems that the scheduler is not running since recently deleted DAGs keep appearing in the dashboard Is there a way to kill the jobs or reset the scheduler? Any idea to un-stuck the composer will be welcomed. You can restart the scheduler as follows: From your cloud shell: 1

Airflow + Cluster + Celery + SQS - Airflow Worker: 'Hub' object has no attribute '_current_http_client'

旧时模样 提交于 2019-12-06 07:11:31
I'm trying to cluster my Airflow setup and I'm using this article to do so. I just configured my airflow.cfg file to use the CeleryExecutor , I pointed my sql_alchemy_conn to my postgresql database that's running on the same master node, I've set the broker_url to use AWS SQS (I didn't set the access_key_id or secret_key since it's running on an EC2-Instance it doesn't need those), and I've set the celery_result_backend to my postgresql server too. I saved my new airflow.cfg changes, I ran airflow initdb , and then I ran airflow scheduler which worked. I went to the UI and turned on one of my

Cannot modify mapred.job.name at runtime. It is not in list of params that are allowed to be modified at runtime

你说的曾经没有我的故事 提交于 2019-12-06 06:29:22
I am trying to run some hive job in airflow. I made custome jdbc connection which you can see in the image. I could query hive tables through airflow web ui(data profiling->ad hoc query). Also I want to run some sample dag file from Internet: #File Name: wf_incremental_load.py from airflow import DAG from airflow.operators import BashOperator, HiveOperator from datetime import datetime, timedelta default_args = { 'owner': 'airflow', 'start_date': datetime(2019, 3, 13), 'retries': 1, 'retry_delay': timedelta(minutes=5) } dag = DAG('hive_test', default_args=default_args,schedule_interval='* */5

How to properly handle Daylight Savings Time in Apache Airflow?

拥有回忆 提交于 2019-12-06 04:40:54
问题 In airflow, everything is supposed to be UTC (which is not affected by DST). However, we have workflows that deliver things based on time zones that are affected by DST. An example scenario: We have a job scheduled with a start date at 8:00 AM Eastern and a schedule interval of 24 hours. Everyday at 8 AM Eastern the scheduler sees that it has been 24 hours since the last run, and runs the job. DST Happens and we lose an hour. Today at 8 AM Eastern the scheduler sees that it has only been 23

How to pass dynamic arguments Airflow operator?

橙三吉。 提交于 2019-12-06 02:23:40
I am using Airflow to run Spark jobs on Google Cloud Composer. I need to Create cluster (YAML parameters supplied by user) list of spark jobs (job params also supplied by per job YAML) With the Airflow API - I can read YAML files, and push variables across tasks using xcom. But, consider the DataprocClusterCreateOperator() cluster_name project_id zone and a few other arguments are marked as templated. What if I want to pass in other arguments as templated (which are currently not so)? - like image_version , num_workers , worker_machine_type etc? Is there any workaround for this? Not sure what

How to run airflow scheduler as a daemon process?

我的梦境 提交于 2019-12-06 02:08:12
问题 I am new to Airflow. I am trying to run airflow scheduler as a daemon process, but the process does not live for long. I have configured "LocalExecutor" in airflow.cfg file and ran the following command to start the scheduler.(I am using Google compute engine and accessing server via PuTTY) airflow scheduler --daemon --num_runs=5 --log-file=/root/airflow/logs/scheduler.log When I run this command, the airflow scheduler starts and I can see the airflow-scheduler.pid file in my airflow home

Airflow k8s operator xcom - Handshake status 403 Forbidden

老子叫甜甜 提交于 2019-12-06 01:37:28
When I run a docker image using KubernetesPodOperator in Airflow version 1.10 Once the pod finishes the task successfullly, airflow tries to get the xcom value by making a connection to the pod via k8s stream client. Following is the error which I encountered: [2018-12-18 05:29:02,209] {{models.py:1760}} ERROR - (0) Reason: Handshake status 403 Forbidden Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/kubernetes/stream/ws_client.py", line 249, in websocket_call client = WSClient(configuration, get_websocket_url(url), headers) File "/usr/local/lib/python3.6/site

Schedule a DAG in airflow to run every 5 minutes

六眼飞鱼酱① 提交于 2019-12-06 01:30:09
问题 I have a DAG in airflow and for now it is running each hour (@hourly). Is it possible to have it running each 5 minutes ? 回答1: Yes, here's an example of a DAG that I have running every 5 min: dag = DAG(dag_id='eth_rates', default_args=args, schedule_interval='*/5 * * * *', dagrun_timeout=timedelta(seconds=5)) schedule_interval accepts a CRON expression: https://en.wikipedia.org/wiki/Cron#CRON_expression 回答2: The documentation states: Each DAG may or may not have a schedule, which informs how

Airflow stops following Spark job submitted over SSH

自作多情 提交于 2019-12-06 01:27:47
I'm using Apache Airflow standalone to submit my Spark jobs with SSHExecutorOperator to connect to the edge node and submit jobs with a simple BashCommand . It's mostly working well but sometimes some random tasks are running undefinitly. My job succeeds, but is still running according to Airflow. When I check the logs, it's like Airflow has stopped following the job as if it didn't get the return value. Why could this happen? Some jobs run for 10h+ and Airflow watches them successfully, while others fail. I have only Spark's logs (at INFO level) without anything printed by the job driver. It