airflow

管理 AirFlow 方法

可紊 提交于 2019-12-29 23:12:07
@[toc] 管理 AirFlow 方法 进程管理工具Supervisord 安装进程管理工具Supervisord管理airflow进程 easy_install supervisor #此方法不适用于python3安装(会出现很多问题) echo_supervisord_conf > /etc/supervisord.conf 编辑文件supervisord.conf,添加启动命令 vi /etc/supervisord.conf [program:airflow_web] command=/usr/bin/airflow webserver -p 8080 [program:airflow_worker] command=/usr/bin/airflow worker [program:airflow_scheduler] command=/usr/bin/airflow scheduler > 3. 启动supervisord服务 /usr/bin/supervisord -c /etc/supervisord.conf > 4. 此时可以用 supervisorctl 来管理airflow服务了 supervisorctl start airflow_web supervisorctl stop airflow_web supervisorctl restart

AirFlow 管理界面使用

孤街浪徒 提交于 2019-12-29 23:11:48
AirFlow管理界面的使用 AirFlow 的webserver UI DAGS 左侧 On/Off 按钮控制 DAG 的运行状态,Off 为暂停状态,On 为运行状态。注意:所有 DAG 脚本初次部署完成时均为 Off 状态。 若 DAG 名称处于不可点击状态,可能为 DAG 被删除或未载入。若 DAG 未载入,可点击右侧刷新按钮进行刷新。注意:由于可以部署若干 WebServer,所以单次刷新可能无法刷新所有 WebServer 缓存,可以尝试多次刷新。 Recent Tasks 会显示最近一次 DAG Run(可以理解为 DAG 的执行记录)中 Task Instances(可以理解为作业的执行记录)的运行状态,如果 DAG Run 的状态为 running,此时显示最近完成的一次以及正在运行的 DAG Run 中所有 Task Instances 的状态。 Last Run 显示最近一次的 execution date。注意:execution date 并不是真实执行时间,具体细节在下文 DAG 配置中详述。将鼠标移至 execution date 右侧 info 标记上,会显示 start date,start date 为真实运行时间。start date 一般为 execution date 所对应的下次执行时间。 作业操作框 在 DAG 的树状图和 DAG

AirFlow 常见问题

我的未来我决定 提交于 2019-12-29 23:11:40
@[toc] AirFlow 常见问题 安装问题 1、安装出现ERROR “python setup.py xxx” 。 问题: 第一需要你更新 pip 版本需要使用'pip install --upgrade pip' command. 第二是 setuptools 版本太旧,所以出现以下问题Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-G9yO9Z/tldr/,也是需要你更新 File "/tmp/pip-build-G9yO9Z/tldr/setuptools_scm-3.3.3-py2.7.egg/setuptools_scm/integration.py", line 9, in version_keyword File "/tmp/pip-build-G9yO9Z/tldr/setuptools_scm-3.3.3-py2.7.egg/setuptools_scm/version.py", line 66, in _warn_if_setuptools_outdated setuptools_scm.version.SetuptoolsOutdatedWarning: your setuptools is too old (<12) --------------

EMR Cluster Creation using Airflow dag run, Once task is done EMR will be terminated

左心房为你撑大大i 提交于 2019-12-29 09:53:26
问题 I have Airflow jobs, which are running fine on the EMR cluster. what I need is, let's say if I have a 4 airflow jobs which required an EMR cluster for let's say 20 min to complete the task. why not we can create an EMR cluster at DAG run time and once the job is to finish it will terminate the created an EMR cluster. 回答1: Absolutely, that would be the most efficient use of resources. Let me warn you: there are a lot of details in this; I'll try to list as many as would get you going. I

Airflow HiveCliHook connection to remote hive cluster?

点点圈 提交于 2019-12-29 09:10:35
问题 I am trying to connect to my hive server from a local copy of Airflow, but it seems like the HiveCliHook is trying to connect to my local copy of Hive. I'm running to following to test it: import airflow from airflow.models import Connection from airflow.hooks.hive_hooks import HiveCliHook usr = 'myusername' pss = 'mypass' session = airflow.settings.Session() hive_cli = session.query(Connection).filter(Connection.conn_id == 'hive_cli_default').all()[0] hive_cli.host = 'hive_server.test

AirflowException: Celery command failed - The recorded hostname does not match this instance's hostname

佐手、 提交于 2019-12-29 08:21:50
问题 I'm running Airflow on a clustered environment running on two AWS EC2-Instances. One for master and one for the worker. The worker node though periodically throws this error when running "$airflow worker": [2018-08-09 16:15:43,553] {jobs.py:2574} WARNING - The recorded hostname ip-1.2.3.4 does not match this instance's hostname ip-1.2.3.4.eco.tanonprod.comanyname.io Traceback (most recent call last): File "/usr/bin/airflow", line 27, in <module> args.func(args) File "/usr/local/lib/python3.6

Python script scheduling in airflow

给你一囗甜甜゛ 提交于 2019-12-29 06:45:16
问题 Hi everyone, I need to schedule my python files(which contains data extraction from sql and some joins) using airflow. I have successfully installed airflow into my linux server and webserver of airflow is available with me. But even after going through documentation I am not clear where exactly I need to write script for scheduling and how will that script be available into airflow webserver so I could see the status As far as the configuration is concerned I know where the dag folder is

How to create path using execution date in Airflow?

情到浓时终转凉″ 提交于 2019-12-25 02:34:06
问题 I have the following Airflow dag: start_task = DummyOperator(task_id='start_task', dag=dag) gcs_export_uri_template = 'adstest/2018/08/31/*' update_bigquery = GoogleCloudStorageToBigQueryOperator( dag=dag, task_id='load_ads_to_BigQuery', bucket=GCS_BUCKET_ID, destination_project_dataset_table=table_name_template, source_format='CSV', source_objects=[gcs_export_uri_template], schema_fields=dc(), create_disposition='CREATE_IF_NEEDED', write_disposition='WRITE_APPEND', skip_leading_rows = 1,

How to prevent “Execution failed:[Errno 32] Broken pipe” in Airflow

霸气de小男生 提交于 2019-12-25 01:49:55
问题 I just started using Airflow to coordinate our ETL pipeline. I encountered the pipe error when I run a dag. I've seen a general stackoverflow discussion here. My case is more on the Airflow side. According to the discussion in that post, the possible root cause is: The broken pipe error usually occurs if your request is blocked or takes too long and after request-side timeout, it'll close the connection and then, when the respond-side (server) tries to write to the socket, it will throw a

Why does Airflow changing start_date without renaming dag?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-25 01:13:36
问题 I am a data engineer and work with airflow regularly. When redeploying dags with a new start date the best practice is as shown in the here: Don’t change start_date + interval : When a DAG has been run, the scheduler database contains instances of the run of that DAG. If you change the start_date or the interval and redeploy it, the scheduler may get confused because the intervals are different or the start_date is way back. The best way to deal with this is to change the version of the DAG