airflow

Airflow SparkSubmitOperator - How to spark-submit in another server

自闭症网瘾萝莉.ら 提交于 2019-12-31 22:39:07
问题 I am new to Airflow and Spark and I am struggling with the SparkSubmitOperator . Our airflow scheduler and our hadoop cluster are not set up on the same machine ( first question: is it a good practice? ). We have many automatic procedures that need to call pyspark scripts. Those pyspark scripts are stored in the hadoop cluster (10.70.1.35). The airflow dags are stored in the airflow machine (10.70.1.22). Currently, when we want to spark-submit a pyspark script with airflow, we use a simple

Airflow: pass {{ ds }} as param to PostgresOperator

青春壹個敷衍的年華 提交于 2019-12-31 20:33:08
问题 i would like to use execution date as parameter to my sql file: i tried dt = '{{ ds }}' s3_to_redshift = PostgresOperator( task_id='s3_to_redshift', postgres_conn_id='redshift', sql='s3_to_redshift.sql', params={'file': dt}, dag=dag ) but it doesn't work. 回答1: dt = '{{ ds }}' Doesn't work because Jinja (the templating engine used within airflow) does not process the entire Dag definition file. For each Operator there are fields which Jinja will process, which are part of the definition of the

airflow pass parameter from cli

两盒软妹~` 提交于 2019-12-31 12:43:07
问题 Is there a way to pass a parameter to: airflow trigger_dag dag_name {param} ? I have a script that monitors a directory for files - when a file gets moves into the target directory I want to trigger the dag passing as a parameter the file path. 回答1: you can pass it like this: airflow trigger_dag --conf {"file_variable": "/path/to/file"} dag_id Then in your dag, you can access this variable using templating as follows: {{ dag_run.conf.file_variable }} If this doesn't work, sharing a simple

duplicate key value violates unique constraint when adding path variable in airflow dag

别来无恙 提交于 2019-12-31 06:04:49
问题 To set up the connections and variables in airflow i use a DAG, we do this inorder to setup airflow fast in case we have to setup everything again fast. It does work my connections and variables show up but the task "fails". The error is saying that there is already an sql_path variable [2018-03-30 19:42:48,784] {{models.py:1595}} ERROR - (psycopg2.IntegrityError) duplicate key value violates unique constraint "variable_key_key" DETAIL: Key (key)=(sql_path) already exists. [SQL: 'INSERT INTO

Dynamic dags not getting added by scheduler

℡╲_俬逩灬. 提交于 2019-12-31 05:38:07
问题 I am trying to create Dynamic DAGs and then get them to the scheduler. I tried the reference from https://www.astronomer.io/guides/dynamically-generating-dags/ which works well. I changed it a bit as in the below code. Need help in debugging the issue. I tried 1. Test run the file. The Dag gets executed and the globals() is printing all the DAGs objects. But somehow not listing in the list_dags or in the UI from datetime import datetime, timedelta import requests import json from airflow

Adding extra celery configs to Airflow

那年仲夏 提交于 2019-12-30 06:57:11
问题 Anyone know where I can add extra celery configs to airflow celery executor? For instance I want http://docs.celeryproject.org/en/latest/userguide/configuration.html#worker-pool-restarts this property but how do I allow extra celery properties.. 回答1: Use the just-released Airflow 1.9.0 and this is now configurable. In airflow.cfg there is this line: # Import path for celery configuration options celery_config_options = airflow.config_templates.default_celery.DEFAULT_CELERY_CONFIG which points

airflow数据表结构了解一下

只谈情不闲聊 提交于 2019-12-29 23:37:07
介绍一下airflow中的表用途 alembic_version # celery_taskmeta # celery_tasksetmeta # chart # connection # dag # dag 任务名的存放表 dag_pickle # dag_run # dag_stats # airflow-web显示所需信息 import_error # job # known_event # known_event_type # kombu_message # kombu_queue # log # 所以dag日志 sla_miss # slot_pool # task_fail # 记录失败的task信息…… task_instance # 记录成功的task执行的 开始时间,结束时间,执行时间 users # airflow认证用户表 variable # xcom # 删除一个废弃的dag ## 首先删除py脚本文件,很重要 set @dag_id = 'BAD_DAG'; delete from airflow.xcom where dag_id = @dag_id; delete from airflow.task_instance where dag_id = @dag_id; delete from airflow.sla_miss where dag_id =

airflow1.8.0部署参考

筅森魡賤 提交于 2019-12-29 23:36:55
airflow官网 airflow-github Airflow1.10版本,有UTC时区问题,所以还是使用airflow1.8.0 公司仍然是centos6.5,Python默认是2.6,而airflow使用Celery执行需要Python2.7,嗯,先准备新的Python环境 ## http://hao.jobbole.com/pythonbrew/ ## export PYTHONBREW_ROOT=/usr/local/.pythonbrew yum install python-pip # 默认安装在家目录 export PYTHONBREW_ROOT=/usr/local/.pythonbrew pip install pythonbrew ## 各种坑 ## 先下载Python源码包 wget https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz #...... yum -y install python-devel mysql-devel ## 官方指导:安装依赖包 https://github.com/utahta/pythonbrew yum -y install zlib-devel openssl-devel readline-devel pythonbrew install Python-2

AirFlow介绍

醉酒当歌 提交于 2019-12-29 23:12:36
AirFlow介绍 一、AirFlow是什么    airflow 是一个编排、调度和监控workflow的平台,由Airbnb开源,现在在Apache Software Foundation 孵化。airflow 将workflow编排为由tasks组成的DAGs(有向无环图),调度器在一组workers上按照指定的依赖关系执行tasks。同时,airflow 提供了丰富的命令行工具和简单易用的用户界面以便用户查看和操作,并且airflow提供了监控和报警系统。    Airflow的调度依赖于crontab命令,与crontab相比airflow可以直观的看到任务执行情况、任务之间的逻辑依赖关系、可以设定任务出错时邮件提醒、可以查看任务执行日志。 而crontab命令管理的方式存在以下几方面的弊端:   1、在多任务调度执行的情况下,难以理清任务之间的依赖关系;   2、不便于查看当前执行到哪一个任务;   3、任务执行失败时不便于查看执行日志,也即不方便定位报错的任务和错误原因;   4、不便于查看调度流下每个任务执行的起止消耗时间,这对于优化task作业是非常重要的;   5、不便于记录历史调度任务的执行情况,而这对于优化作业和错误排查是很重要的; 1、优劣势分析 未使用airflow 使用airflow 需要自己添加调度代码、调试复杂、功能单一、缺乏整体调度能力 框架调度

AirFlow 安装配置

六眼飞鱼酱① 提交于 2019-12-29 23:12:20
airflow 安装配置 airflow 相关软件安装 python 3.6.5 安装 安装依赖程序 ; [root@node01 ~]# yum -y install zlib zlib-devel bzip2 bzip2-devel ncurses ncurses-devel readline readline-devel openssl openssl-devel openssl-static xz lzma xz-devel sqlite sqlite-devel gdbm gdbm-devel tk tk-devel gcc 下载python ; 可以前往 https://www.python.org/ftp/python/查看Python各个版本,这里,我们选择安装Python-3.6.5.tgz版本。通过如下命令下载Python源码压缩包 : [root@node01 ~]# wget https://www.python.org/ftp/python/3.6.5/Python-3.6.5.tgz 解压Python源码压缩包 ; [root@node01 ~]# tar -zxvf Python-3.6.5.tgz [root@node01 ~]# cd Python-3.6.5 安装python ; [root@node01 Python-3.6.5]# .