airflow

AirFlow简介

孤者浪人 提交于 2019-12-19 00:05:46
1, 简介 ​ Airflow是一个可编程,调度和监控的工作流平台,基于有向无环图(DAG),airflow可以定义一组有依赖的任务,按照依赖依次执行。airflow提供了丰富的命令行工具用于系统管控,而其web管理界面同样也可以方便的管控调度任务,并且对任务运行状态进行实时监控,方便了系统的运维和管理。 2,执行器(Executor) ​ Airflow本身是一个综合平台,它兼容多种组件,所以在使用的时候有多种方案可以选择。比如最关键的执行器就有四种选择: SequentialExecutor:单进程顺序执行任务,默认执行器,通常只用于测试 LocalExecutor:多进程本地执行任务 CeleryExecutor:分布式调度,生产常用 DaskExecutor :动态任务调度,主要用于数据分析 在当前项目使用 CeleryExecutor 作为执行器。 celery是一个分布式调度框架,其本身无队列功能,需要使用第三方组件,比如redis或者rabbitmq,当前项目使用的是rabbitmq,系统整体结构如下所示: 其中: turing为外部系统 GDags服务帮助拼接成dag master节点webui管理dags、日志等信息 scheduler负责调度,只支持单节点 worker负责执行具体dag中的task, worker支持多节点 在整个调度系统中

Adding logs to Airflow Logs

风流意气都作罢 提交于 2019-12-18 19:06:57
问题 How can I add my own logs onto the Apache Airflow logs that are automatically generated? any print statements wont get logged in there, so I was wondering how I can add my logs so that it shows up on the UI as well? 回答1: I think you can work around this by using the logging module and trusting the configuration to Airflow. Something like: import ... dag = ... def print_params_fn(**kwargs): import logging logging.info(kwargs) return None print_params = PythonOperator(task_id="print_params",

Airflow pass parameters to dependent task

杀马特。学长 韩版系。学妹 提交于 2019-12-18 18:37:37
问题 What is the way to pass parameter into dependent tasks in Airflow? I have a lot of bashes files, and i'm trying to migrate this approach to airflow, but i don't know how to pass some properties between tasks. This is a real example: #sqoop bash template sqoop_template = """ sqoop job --exec {{params.job}} -- --target-dir {{params.dir}} --outdir /src/ """ s3_template = """ s3-dist-cp --src= {{params.dir}} "--dest={{params.s3}} """ #Task of extraction in EMR t1 = BashOperator( task_id='extract

AirFlow

我是研究僧i 提交于 2019-12-18 18:09:30
##AirFlow## #简介# Airflow是一个可编程,调度和监控的工作流平台,基于有向无环图(DAG),airflow可以定义一组有依赖的任务,按照依赖依次执行。airflow提供了丰富的命令行工具用于系统管控,而其web管理界面同样也可以方便的管控调度任务,并且对任务运行状态进行实时监控,方便了系统的运维和管理。 #内容# https://www.cnblogs.com/cord/p/9450910.html airflow 安装,部署,填坑 https://www.jianshu.com/p/9bed1e3ab93b 如何部署一个健壮的 apache-airflow 调度系统 https://www.jianshu.com/p/2ecef979c606 来源: CSDN 作者: weixin_45965884 链接: https://blog.csdn.net/weixin_45965884/article/details/103584965

Airflow - creating dynamic Tasks from XCOM

时光总嘲笑我的痴心妄想 提交于 2019-12-18 17:11:29
问题 I'm attempting to generate a set of dynamic tasks from a XCOM variable. In the XCOM I'm storing a list and I want to use each element of the list to dynamically create a downstream task. My use case is that I have an upstream operator that checks a sftp server for files and returns a list of file names matching specific criteria. I want to create dynamic downstream tasks for each of the file names returned. I've simplified it to the below, and while it works I feel like its not an idiomatic

Airflow: Log file isn't local, Unsupported remote log location

删除回忆录丶 提交于 2019-12-18 14:51:55
问题 I am not able see the logs attached to the tasks from the Airflow UI: Log related settings in airflow.cfg file are: remote_base_log_folder = base_log_folder = /home/my_projects/ksaprice_project/airflow/logs worker_log_server_port = 8793 child_process_log_directory = /home/my_projects/ksaprice_project/airflow/logs/scheduler Although I am setting remote_base_log_folter it is trying to fetch the log from http://:8793/log/tutorial/print_date/2017-08-02T00:00:00 - I don't understand this behavior.

Airflow dynamic tasks at runtime

梦想的初衷 提交于 2019-12-18 14:48:15
问题 Other questions about 'dynamic tasks' seem to address dynamic construction of a DAG at schedule or design time. I'm interested in dynamically adding tasks to a DAG during execution. from airflow import DAG from airflow.operators.dummy_operator import DummyOperator from airflow.operators.python_operator import PythonOperator from datetime import datetime dag = DAG('test_dag', description='a test', schedule_interval='0 0 * * *', start_date=datetime(2018, 1, 1), catchup=False) def make_tasks():

Is there a way to submit spark job on different server running master

戏子无情 提交于 2019-12-18 13:36:54
问题 We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master. Answer to this will be highly appreciated. Thanks in advance. 回答1: There are 3 ways you can submit Spark jobs using Apache Airflow remotely: (1) Using SparkSubmitOperator : This operator expects you have a

Writing to Airflow Logs

懵懂的女人 提交于 2019-12-18 12:48:06
问题 One way to write to the logs in Airflow is to return a string from a PythonOperator like on line 44 here. Are there other ways that allow me to write to the airflow log files? I've found that print statements are not saved to the logs. 回答1: You can import the logging module into your code and write to logs that way import logging logging.info('Hello') Here are some more options import logging logging.debug('This is a debug message') logging.info('This is an info message') logging.warning(

Airflow dynamic DAG and Task Ids

百般思念 提交于 2019-12-18 12:17:45
问题 I mostly see Airflow being used for ETL/Bid data related jobs. I'm trying to use it for business workflows wherein a user action triggers a set of dependent tasks in future. Some of these tasks may need to be cleared (deleted) based on certain other user actions. I thought the best way to handle this would be via dynamic task ids. I read that Airflow supports dynamic dag ids. So, I created a simple python script that takes DAG id and task id as command line parameters. However, I'm running