airflow

loop over airflow variables issue question

回眸只為那壹抹淺笑 提交于 2019-12-11 16:47:31
问题 I am having hard time looping over an airflow variable in my script so I have a requirement to list all files prefixed by string in a bucket. next loop through it and do some operations. I tried making use of xcomm and subdags but i couldn't figure it out so i came up with a new approach. it involves 2 scripts though 1 st scripts sets the airflow variable with a value i generate below is the code. #!/usr/bin/env python with DAG('Test_variable', default_args=default_args, schedule_interval

airflow dag scheduler trigger execution_date

删除回忆录丶 提交于 2019-12-11 16:46:57
问题 ENV: https://github.com/puckel/docker-airflow VERSION :1.8.1-1 Executor : LocalExecutor DAG SETTING: start_date: datetime(2018, 1, 8) schedule_interval : daily Current time:[2018-01-11 06:23:00] Question : Why the dag d3's run_id=[scheduled__2018-01- 11 T00:00:00] has not been triggered at the current time[2018-01-11 06:23:00] ? Current time [2018-01-11 06:23:00] , is there any way to schedule trigger the d3's run_id=[scheduled__2018-01- 11 T00:00:00] ,not [scheduled__2018-01- 10 T00:00:00]

Creating dynamic tasks in airflow (in composer) based on bigquery response

怎甘沉沦 提交于 2019-12-11 16:25:01
问题 I am trying to create a airflow DAG which generates task depending on the response from server. Here is my approach : getlist of tables from bigquery -> loop through the list and create tasks This is my latest code and I have tried all possible code found in stack overflow. Nothing seems to work. What am I doing wrong? with models.DAG(dag_id="xt", default_args=default_args, schedule_interval="0 1 * * *", catchup=True) as dag: tables = get_tables_from_bq() bridge = DummyOperator( task_id=

Configuring Google cloud bucket as Airflow Log folder

元气小坏坏 提交于 2019-12-11 16:13:54
问题 We just started using Apache airflow in our project for our data pipelines .While exploring the features came to know about configuring remote folder as log destination in airflow .For that we Created a google cloud bucket. From Airflow UI created a new GS connection I am not able to understand all the fields .I just created a sample GS Bucket under my project from google console and gave that project ID to this Connection.Left key file path and scopes as blank. Then edited airflow.cfg file

Airflow 1.10.2 not writing logs to S3

余生长醉 提交于 2019-12-11 16:03:21
问题 I'm trying to run airflow in a docker container and send the logs to s3. I've the following environment Airflow Version: 1.10.2 Also updated the following in the airflow.cfg logging_config_class = log_config.LOGGING_CONFIG where LOGGING_CONFIG is defined in the class config/log_config.py . I've created following files: config/__init__.py config/log_config.py I've set up the log_config.py in the following way: # -*- coding: utf-8 -*- # # Licensed to the Apache Software Foundation (ASF) under

In Apache Airflow Tool, DAG wont run due to Duplicate Entry Problem in task_instance table

喜你入骨 提交于 2019-12-11 15:52:16
问题 Today all day i have been getting this error in the scheduler of Airflow. sqlalchemy.exc.IntegrityError: (_mysql_exceptions.IntegrityError) (1062, "Duplicate entry '%' fir key 'PRIMARY')") Because of this the Airflow Scheduler would stop and every time i ran this had the same problem 回答1: This is due to MySQL's ON UPDATE CURRENT_TIMESTAMP and this is posted in JIRA of Airflow : https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-3045?filter=allopenissues I fixed this by altering

S3 Delete & HDFS to S3 Copy

佐手、 提交于 2019-12-11 15:43:09
问题 As a part of my Spark pipeline , I have to perform following tasks on EMR / S3 : Delete : (Recursively) Delete all files / directories under a given S3 bucket Copy : Copy contents of a directory (subdirectories & files) to a given S3 bucket Based on my current knowledge, Airflow doesn't provide operator s / hook s for these tasks. I therefore plan to implement them as follows: Delete : Extend S3Hook to add a function that performs aws s3 rm on specified S3 bucket Copy : Use SSHExecuteOperator

SQL Server Hook and operator connection in Airflow

落爺英雄遲暮 提交于 2019-12-11 15:26:40
问题 I am new to using airflow and what I need to do is to use MssqlHook or MssqlOperator but I do not know how. By using hook and operator below code hook = MsSqlHook(mssql_conn_id=ms_sql) t2 = MsSqlOperator( task_id = 'sql-op', mssql_conn_id = ms_sql, sql = 'Select Current_date()', dag = dag) in Airflow UI connections- Conn Id - ms_sql Conn Type -Microsoft SQL server Host - host_name schema- default login- username password- password port - 14481 And when I do this the error is Connection failed

Airflow list dag times out exactly after 30 seconds

℡╲_俬逩灬. 提交于 2019-12-11 15:15:50
问题 I have a dynamic airflow dag( backfill_dag ) that basically reads admin variable(Json) and builds it self. Backfill_dag is used for backfilling/history loading, so for example if I wants to history load dag x,y, n z in some order(x n y run in parallel, z depends on x) then I will mention this in a particular json format and put it in admin variable of backfill_dag . Backfill_dag now: parses the Json, renders the tasks of the dags x,y, and z, and builds itself dynamically with x and y in

Passing typesafe config conf files to DataProcSparkOperator

℡╲_俬逩灬. 提交于 2019-12-11 12:42:37
问题 I am using Google dataproc to submit spark jobs and google cloud composer to schedule them. Unfortunately, I am facing difficulties. I am relying on .conf files (typesafe config files) to pass arguments to my spark jobs. I am using the following python code for the airflow dataproc: t3 = dataproc_operator.DataProcSparkOperator( task_id ='execute_spark_job_cluster_test', dataproc_spark_jars='gs://snapshots/jars/pubsub-assembly-0.1.14-SNAPSHOT.jar', cluster_name='cluster', main_class = 'com