airflow

Can you get a static external IP address for Google Cloud Composer / Airflow?

不打扰是莪最后的温柔 提交于 2019-12-08 03:20:38
问题 I know how to assign a static external IP address to a Compute Engine, but can this be done with Google Cloud Composer (Airflow)? I'd imagine most companies need that functionality since they'd generally be writing back to a warehouse that is possibly behind a firewall, but I can't find any doc's on how to do this. 回答1: It's not possible to assign a static IP to the underlying GKE cluster in a Composer environment. The endpoint @kaxil mentioned is the Kubernetes master endpoint but not the

Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

女生的网名这么多〃 提交于 2019-12-08 01:34:36
问题 I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs. Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit. What is best way to track Spark job using Airflow if even I submitted? 回答1: My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated

Airflow Audit Logs

江枫思渺然 提交于 2019-12-08 00:10:31
问题 I'm wondering what Airflow offers in the sense of Audit Logs. My Airflow environment is running Airflow version 1.10 and uses the [ldap] section of the airflow.cfg file to use my companies Active Dicrectory (AD) for authentication. I see when someone logs into Airflow through the Web UI it writes the users name into the webserver's log (shown below). I'm wondering though if Airflow can be modified to also log when the user turns on/off a DAG, creates a new Airflow Variable or Pool, Clears a

Test an Apache Airflow DAG while it is already scheduled and running?

安稳与你 提交于 2019-12-07 22:19:19
问题 I ran the following test command: airflow test events {task_name_redacted} 2018-12-12 ...and got the following output: Dependencies not met for <TaskInstance: events.{redacted} 2018-12-12T00:00:00+00:00 [None]>, dependency 'Task Instance Slots Available' FAILED: The maximum number of running tasks (16) for this task's DAG 'events' has been reached. [2019-01-17 19:47:48,978] {models.py:1556} WARNING - -------------------------------------------------------------------------------- FIXME:

Cannot modify mapred.job.name at runtime. It is not in list of params that are allowed to be modified at runtime

↘锁芯ラ 提交于 2019-12-07 18:46:32
问题 I am trying to run some hive job in airflow. I made custome jdbc connection which you can see in the image. I could query hive tables through airflow web ui(data profiling->ad hoc query). Also I want to run some sample dag file from Internet: #File Name: wf_incremental_load.py from airflow import DAG from airflow.operators import BashOperator, HiveOperator from datetime import datetime, timedelta default_args = { 'owner': 'airflow', 'start_date': datetime(2019, 3, 13), 'retries': 1, 'retry

Airflow stops following Spark job submitted over SSH

六眼飞鱼酱① 提交于 2019-12-07 16:15:04
问题 I'm using Apache Airflow standalone to submit my Spark jobs with SSHExecutorOperator to connect to the edge node and submit jobs with a simple BashCommand . It's mostly working well but sometimes some random tasks are running undefinitly. My job succeeds, but is still running according to Airflow. When I check the logs, it's like Airflow has stopped following the job as if it didn't get the return value. Why could this happen? Some jobs run for 10h+ and Airflow watches them successfully,

ImportError : cannot import DAG airflow

故事扮演 提交于 2019-12-07 16:12:19
问题 I have simple code, I am trying to import DAG from airflow from airflow import DAG from airflow.operators import BashOperator,S3KeySensor from datetime import datetime, timedelta import psycopg2 from datetime import date, timedelta yesterday = date.today() - timedelta(1) yesterdayDate = yesterday.strftime('%Y-%m-%d') But, I am getting Import Error Traceback (most recent call last): File "airflow.py", line 9, in <module> from airflow import DAG File "/home/ubuntu/airflow/dags/airflow.py", line

Airflow : DAG marked as “success” if one task fails, because of trigger rule ALL_DONE

两盒软妹~` 提交于 2019-12-07 14:26:57
问题 I have the following DAG with 3 tasks : start --> special_task --> end The task in the middle can succeed or fail, but end must always be executed (imagine this is a task for cleanly closing resources). For that, I used the trigger rule ALL_DONE : end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE Using that, end is properly executed if special_task fails. However, since end is the last task and succeeds, the DAG is always marked as SUCCESS . How can I configure my DAG so that if one of the

Airflow 1.10.3 - Blank “Recent Tasks” and “DAG Runs”

放肆的年华 提交于 2019-12-07 11:48:15
问题 I installed Airflow 1.10.3 on Ubuntu 18.10 and am able to add my DAGs and run them but "Recent Tasks" and "DAG Runs" in the Web UI are blank. All I see are a black dotted circle which keeps loading but nothing ever materializes. I recently upgraded my Airflow db to MySQL to see if that would fix it but everything is still the same. Is this a configuration issue in airflow.cfg or something else? 回答1: Apparently the DAG name can break the HTML document variable querySelector for "Recent Tasks"

How to really create n tasks in a SubDAG based on the result of a previous task

自古美人都是妖i 提交于 2019-12-07 05:07:53
问题 I'm creating a dynamic DAG in Airflow using SubDAGs. The thing I need is that the number of tasks inside the SubDAG is determined by the result of a previous task (the subtask_ids variable of the middle_section function should be the same variable of the initial_task function). The thing is that I can't access xcom inside the subdag function of a SubDagOperator because I haven't any context. Also, I can't reach to any DB for reading some value because of the autodiscovery DAG feature of the