airflow

Airflow 调度框架 安装经验分享 单节点

送分小仙女□ 提交于 2020-01-15 06:49:58
crontab定时任务不利于平时的监控,决定使用一种新的调度框架 1.安装依赖 # 避免连接密码以明文形式存储 pip3 install cryptography pip3 install paramiko # AttributeError: module 'enum' has no attribute 'IntFlag' pip3 uninstall enum34 pip3 install celery pip3 install redis pip3 install dask yum install mysql - devel pip3 install mysqlclient pip3 install apache - airflow # 避免产生大量日志 / usr / local / lib / python3 . 7 / site - packages / airflow vim settings . py # LOGGING_LEVEL = logging.INFO LOGGING_LEVEL = logging . WARN 2.配置环境变量 # vim /etc/profile # 指定airflow工作目录,airflow的工作目录默认在当前用户目录下 export AIRFLOW_HOME = / usr / local / airflow # source

Export all airflow connections to new environment

邮差的信 提交于 2020-01-15 06:22:05
问题 I'm trying to migrate all the existing airflow connections to a new airflow. I was looking at the cli options airflow connections --help , it gives an option to list but doesn't give an option to export/import to/from json format. Is there a way via cli/airflow ui to migrate connections across multiple airflows? 回答1: You can either connect directly to the Airflow meta db and dump those connections, then load them in a separate database. However, if you want to automate something like this,

Use airflow hive operator and output to a text file

有些话、适合烂在心里 提交于 2020-01-14 22:33:37
问题 Hi I want to execute hive query using airflow hive operator and output the result to a file. I don't want to use INSERT OVERWRITE here. hive_ex = HiveOperator( task_id='hive-ex', hql='/sql/hive-ex.sql', hiveconfs={ 'DAY': '{{ ds }}', 'YESTERDAY': '{{ yesterday_ds }}', 'OUTPUT': '{{ file_path }}'+'csv', }, dag=dag ) What is the best way to do this? I know how to do this using bash operator,but want to know if we can use hive operator hive_ex = BashOperator( task_id='hive-ex', bash_command=

Use airflow hive operator and output to a text file

一世执手 提交于 2020-01-14 22:26:02
问题 Hi I want to execute hive query using airflow hive operator and output the result to a file. I don't want to use INSERT OVERWRITE here. hive_ex = HiveOperator( task_id='hive-ex', hql='/sql/hive-ex.sql', hiveconfs={ 'DAY': '{{ ds }}', 'YESTERDAY': '{{ yesterday_ds }}', 'OUTPUT': '{{ file_path }}'+'csv', }, dag=dag ) What is the best way to do this? I know how to do this using bash operator,but want to know if we can use hive operator hive_ex = BashOperator( task_id='hive-ex', bash_command=

Airflow task after BranchPythonOperator does not fail and succeed correctly

我与影子孤独终老i 提交于 2020-01-14 13:47:08
问题 In my DAG, I have some tasks that should only be run on Saturdays. Therefore I used a BranchPythonOperator to branch between the tasks for Saturdays and a DummyTask. After that, I join both branches and want to run other tasks. The workflow looks like this: Here I set the trigger rule for dummy3 to 'one_success' and everything works fine. The problem I encountered is when something upstream of the BranchPythonOperator fails: The BranchPythonOperator and the branches correctly have the state

How to delete XCOM objects once the DAG finishes its run in Airflow

醉酒当歌 提交于 2020-01-14 08:11:11
问题 I have a huge json file in the XCOM which later I do not need once the dag execution is finished, but I still see the Xcom Object in the UI with all the data, Is there any way to delete the XCOM programmatically once the DAG run is finished. Thank you 回答1: You have to add a task depends on you metadatadb (sqllite, PostgreSql, MySql..) that delete XCOM once the DAG run is finished. delete_xcom_task = PostgresOperator( task_id='delete-xcom-task', postgres_conn_id='airflow_db', sql="delete from

For Apache Airflow, How can I pass the parameters when manually trigger DAG via CLI?

爱⌒轻易说出口 提交于 2020-01-13 10:26:29
问题 I use Airflow to manage ETL tasks execution and schedule. A DAG has been created and it works fine. But is it possible to pass parameters when manually trigger the dag via cli. For example: My DAG runs every day at 01:30, and processes data for yesterday(time range from 01:30 yesterday to 01:30 today). There might be some issues with the data source. I need to re-process those data (manually specify the time range). So can I create such an airflow DAG, when it's scheduled, that the default

Export environment variables at runtime with airflow

帅比萌擦擦* 提交于 2020-01-13 09:38:26
问题 I am currently converting workflows that were implemented in bash scripts before to Airflow DAGs. In the bash scripts, I was just exporting the variables at run time with export HADOOP_CONF_DIR="/etc/hadoop/conf" Now I'd like to do the same in Airflow, but haven't found a solution for this yet. The one workaround I found was setting the variables with os.environ[VAR_NAME]='some_text' outside of any method or operator, but that means they get exported the moment the script gets loaded, not at

Run .EXE and Powershell tasks with Airflow

不打扰是莪最后的温柔 提交于 2020-01-13 06:11:07
问题 our systems are basically just Windows Servers running C# and Powershell applications in conjunction with MS SQL Server. We have a in-house WorkflowManagement solution that is able to run tasks that execute EXE/BAT/PS1 and even call DLL-Functions. Now I am evaluating if Apache Airflow is a better solution for us. My naive plan so far is to run airflow scheduler on a Linux-machine and then let the consumers run on Windows machines. But how would I setup the consumer to run a .exe task for

Airflow psycopg2.OperationalError: FATAL: sorry, too many clients already

无人久伴 提交于 2020-01-13 02:54:41
问题 I have a four node clustered Airflow environment that's been working fine for me for a few months now. ec2-instances Server 1: Webserver, Scheduler, Redis Queue, PostgreSQL Database Server 2: Webserver Server 3: Worker Server 4: Worker Recently I've been working on a more complex DAG that has a few dozen tasks in it compared to my relatively small ones I was working on beforehand. I'm not sure if that's why I'm just now seeing this error pop up or what but I'll sporadically get this error: On