apache-airflow

How do I set an environment variable for airflow to use?

白昼怎懂夜的黑 提交于 2019-12-10 17:05:49
问题 Airflow is returning an error when trying to run a DAG saying that it can't find an environment variable, which is odd because it's able to find 3 other environment variables that I'm storing as a Python variable. No issues with those variables at all. I have all 4 variables in ~/.profile and have also done export var1="varirable1" export var2="varirable2" export var3="varirable3" export var4="varirable4" Under what user does airflow run? I've done those export commands under sudo as well, so

Airflow DAG success callback

你离开我真会死。 提交于 2019-12-10 13:38:50
问题 Is there an elegant way to define callback for DAG succeed event? I really don't want to set a task which will be upstream of all other tasks with on_sucess_callback. Thanks! 回答1: So if I understand correctly, the last step of your DAG is, in case of success, to call back to some other system. So I would encourage you to model your DAG exactly that way. Why would you try to hide that part from the logic of your DAG? That's exactly what the up/downstream modeling is for. Hiding part of the DAG

Status of Airflow task within the dag

只谈情不闲聊 提交于 2019-12-09 09:49:48
问题 I need the status of the task like if it is running or upforretry or failed within the same dag. So i tried to get it using the below code, though i got no output... Auto = PythonOperator( task_id='test_sleep', python_callable=execute_on_emr, op_kwargs={'cmd':'python /home/hadoop/test/testsleep.py'}, dag=dag) logger.info(Auto) The intention is to kill certain running tasks once a particular task on airflow completes. Question is how do i get the state of a task like is it in the running state

Want to create airflow tasks that are downstream of the current task

馋奶兔 提交于 2019-12-08 17:31:31
I'm mostly brand new to airflow. I have a two step process: Get all files that match a criteria Uncompress the files The files are half a gig compressed, and 2 - 3 gig when uncompressed. I can easily have 20+ files to process at a time, which means uncompressing all of them can run longer than just about any reasonable timeout I could use XCom to get the results of step 1, but what I'd like to do is something like this: def processFiles (reqDir, gvcfDir, matchSuffix): theFiles = getFiles (reqDir, gvcfDir, matchSuffix) for filePair in theFiles: task = PythonOperator (task_id = "Uncompress_" +

Airflow Remote logging not working

心已入冬 提交于 2019-12-08 01:38:23
问题 I have a up and running Apache - Airflow 1.8.1 instance. I got a working connection (and it's ID) to write to Google Cloud Storage and my airflow user has the permission to write to the bucket. I try to use the remote log storage functionality by adding remote_base_log_folder = 'gs://my-bucket/log' remote_log_conn_id = 'my_working_conn_id' And that's all (I didn't touch any configuration but that) I restarted all the services but the log aren't uploading to gcs (my bucket it's still empty)

How to get the JobID for the airflow dag runs?

坚强是说给别人听的谎言 提交于 2019-12-07 02:07:23
问题 When we do a dagrun, on the Airflow UI, in the "Graph View" we get details of each job run. JobID is something like "scheduled__2017-04-11T10:47:00" . I need this JobID for tracking and log creation in which I maintain time each task/dagrun took. So my question is how can i get the JobID within the same dag that is being run . Thanks,Chetan 回答1: This value is actually called run_id and can be accessed via the context or macros. In the python operator this is accessed via context, and in the

Airflow will keep showing example dags even after removing it from configuration

对着背影说爱祢 提交于 2019-12-07 01:34:13
问题 Airflow example dags remain in the UI even after I have turned off load_examples = False in config file. The system informs the dags are not present in the dag folder but they remain in UI because the scheduler has marked it as active in the metadata database. I know one way to remove them from there would be to directly delete these rows in the database but off course this is not ideal.How should I proceed to remove these dags from UI? 回答1: There is currently no way of stopping a deleted DAG

How to wait for an asynchronous event in a task of a DAG in a workflow implemented using Airflow?

徘徊边缘 提交于 2019-12-06 21:22:42
问题 My workflow implemented using Airflow contains tasks A, B, C, and D. I want the workflow to wait at task C for an event. In Airflow sensors are used to check for some condition by polling for some state, if that condition is true then the next task in the workflow gets triggered. My requirement is to avoid polling. Here one answer mentions about a rest_api_plugin of airflow which creates rest_api endpoint to trigger airflow CLI - using this plugin I can trigger a task in the workflow. In my

Copy files from one Google Cloud Storage Bucket to other using Apache Airflow

本秂侑毒 提交于 2019-12-06 15:27:33
Problem : I want to copy files from a folder in Google Cloud Storage Bucket (e.g Folder1 in Bucket1) to another Bucket (e.g Bucket2). I can't find any Airflow Operator for Google Cloud Storage to copy files. I know this is an old question but I found myself dealing with this task too. Since I'm using the Google Cloud-Composer, GoogleCloudStorageToGoogleCloudStorageOperator was not available in the current version. I managed to solve this issue by using a simple BashOperator from airflow.operators.bash_operator import BashOperator with models.DAG( dag_name, schedule_interval=timedelta(days=1),

Apache Airflow unable to establish connect to remote host via FTP/SFTP

谁都会走 提交于 2019-12-06 11:01:52
问题 I am new to Apache Airflow and so far, I have been able to work my way through problems I have encountered. I have hit a wall now. I need to transfer files to a remote server via sftp. I have not had any luck doing this. So far, I have gotten S3 and Postgres/Redshift connections via their respective hooks to work in various DAGs. I have been able to use the FTPHook with success testing on my local FTP server, but have not been able to figure out how to use SFTP to connect to a remote host. I