airflow

Airflow scheduler is slow to schedule subsequent tasks

余生颓废 提交于 2019-11-30 12:49:13
问题 When I try to run a DAG in Airflow 1.8.0 I find that it takes a lot of time between the time of completion predecessor task and the time at which the successor task is picked up for execution (usually greater the execution times of individual tasks). The same is the scenario for Sequential, Local and Celery Executors. Is there a way to lessen the overhead time mentioned? (like any parameters in airflow.cfg that can speed up the DAG execution?) Gantt chart has been added for reference: 回答1: As

Airflow: Log file isn't local, Unsupported remote log location

强颜欢笑 提交于 2019-11-30 12:13:37
I am not able see the logs attached to the tasks from the Airflow UI: Log related settings in airflow.cfg file are: remote_base_log_folder = base_log_folder = /home/my_projects/ksaprice_project/airflow/logs worker_log_server_port = 8793 child_process_log_directory = /home/my_projects/ksaprice_project/airflow/logs/scheduler Although I am setting remote_base_log_folter it is trying to fetch the log from http://:8793/log/tutorial/print_date/2017-08-02T00:00:00 - I don't understand this behavior. According to the settings the workers should store the logs at /home/my_projects/ksaprice_project

How to remove default example dags in airflow

前提是你 提交于 2019-11-30 10:55:24
问题 I am a new user of Airbnb's open source workflow/datapipeline software airflow. There are dozens of default example dags after the web UI is started. I tried many ways to remove these dags, but I've failed to do so. load_examples = False is set in airflow.cfg. Folder lib/python2.7/site-packages/airflow/example_dags is removed. States of those example dags are changed to gray after I removed the dags folder, but the items still occupy the web UI screen. And a new dag folder is specified in

Is there a way to submit spark job on different server running master

。_饼干妹妹 提交于 2019-11-30 10:27:49
We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master. Answer to this will be highly appreciated. Thanks in advance. There are 3 ways you can submit Spark jobs using Apache Airflow remotely: (1) Using SparkSubmitOperator : This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command

Unable to start Airflow worker/flower and need clarification on Airflow architecture to confirm that the installation is correct

倖福魔咒の 提交于 2019-11-30 06:57:24
问题 Running a worker on a different machine results in errors specified below. I have followed the configuration instructions and have sync the dags folder. I would also like to confirm that RabbitMQ and PostgreSQL only needs to be installed on the Airflow core machine and does not need to be installed on the workers (the workers only connect to the core). The specification of the setup is detailed below: Airflow core/server computer Has the following installed: Python 2.7 with airflow (AIRFLOW

Airflow dynamic DAG and Task Ids

可紊 提交于 2019-11-30 06:22:48
I mostly see Airflow being used for ETL/Bid data related jobs. I'm trying to use it for business workflows wherein a user action triggers a set of dependent tasks in future. Some of these tasks may need to be cleared (deleted) based on certain other user actions. I thought the best way to handle this would be via dynamic task ids. I read that Airflow supports dynamic dag ids. So, I created a simple python script that takes DAG id and task id as command line parameters. However, I'm running into problems making it work. It gives dag_id not found error. Has anyone tried this? Here's the code for

Access parent dag context at subtag creation time in airflow?

允我心安 提交于 2019-11-30 06:02:45
问题 I'm trying to access at subdag creation time some xcom data from parent dag, I was searching to achieve this on internet but I didn't find something. def test(task_id): logging.info(f' execution of task {task_id}') def load_subdag(parent_dag_id, child_dag_id, args): dag_subdag = DAG( dag_id='{0}.{1}'.format(parent_dag_id, child_dag_id), default_args=args, schedule_interval="@daily", ) with dag_subdag: r = DummyOperator(task_id='random') for i in range(r.xcom_pull(task_ids='take_Ana', key='the

Is there a way to create/modify connections through Airflow API

给你一囗甜甜゛ 提交于 2019-11-30 05:08:25
Going through Admin -> Connections , we have the ability to create/modify a connection's params, but I'm wondering if I can do the same through API so I can programmatically set the connections airflow.models.Connection seems like it only deals with actually connecting to the instance instead of saving it to the list. It seems like a function that should have been implemented, but I'm not sure where I can find the docs for this specific function. Connection is actually a model which you can use to query and insert a new connection from airflow import settings from airflow.models import

Airflow: How to SSH and run BashOperator from a different server

允我心安 提交于 2019-11-30 05:00:40
Is there a way to ssh to different server and run BashOperator using Airbnb's Airflow? I am trying to run a hive sql command with Airflow but I need to SSH to a different box in order to run the hive shell. My tasks should look like this: SSH to server1 start Hive shell run Hive command Thanks! CMPE I think that I just figured it out: Create a SSH connection in UI under Admin > Connection. Note: the connection will be deleted if you reset the database In the Python file add the following from airflow.contrib.hooks import SSHHook sshHook = SSHHook(conn_id=<YOUR CONNECTION ID FROM THE UI>) Add

Airflow setup for high availability

℡╲_俬逩灬. 提交于 2019-11-30 03:06:45
问题 How to deploy apache airflow (formally known as airbnb's airflow) scheduler in high availability? I am not asking about the backend DB or RabbitMQ that should obviously be deployed in high availability configuration. My main focus is the scheduler - is there something special needs to be done? 回答1: After a bit digging I found that it is not safe to run multiple schedulers simoultanously, this means that out of the box - airflow schedulers are not safe to use in high availablity environments.