airflow | 易学教程

Airflow user creation

阅读更多关于 Airflow user creation

问题 I am using Airflow version 1.8.2 and set up couple of Dags.Everything running as expected .I have admin user created for airflow web server access.But for other teams to monitor their job we can't provide this admin user So I tried to create a different user from the UI '/admin/user/'. But only the following fields are available .No options to provide roles or password etc. Does anyone faced the same issue or I am doing some wrong thing .How to create role based users So that I can tag some

Unable to execute spark job using SparkSubmitOperator

阅读更多关于 Unable to execute spark job using SparkSubmitOperator

问题 I am able to run Spark job using BashOperator but I want to use SparkSubmitOperator for it using Spark standalone mode . Here's my DAG for SparkSubmitOperator and stack-trace args = { 'owner': 'airflow', 'start_date': datetime(2018, 5, 24) } dag = DAG('spark_job', default_args=args, schedule_interval="*/10 * * * *") operator = SparkSubmitOperator( task_id='spark_submit_job', application='/home/ubuntu/test.py', total_executor_cores='1', executor_cores='1', executor_memory='2g', num_executors=

How to allow airflow dags for concrete user(s) only

阅读更多关于 How to allow airflow dags for concrete user(s) only

问题 The problem is pretty simple. I need to limit airflow web users to see and execute only certain DAGs and tasks. If possible, I'd prefer not to use Kerberos nor OAuth. The Multi-tenancy option seems like an option to go, but couldn't make it work the way I expect. My current setup: added airflow web users test and ikar via Web Authentication / Password my unix username is ikar with a home in /home/ikar no test unix user airflow 1.8.2 is installed in /home/ikar/airflow added two DAGs with one

How to Connect Airflow to oracle database

阅读更多关于 How to Connect Airflow to oracle database

问题 I am trying to create a connection to an oracle db instance (oracle:thin) using Airflow. According to their documentation I entered my hostname followed by port number and SID: Host: example.com:1524/sid filled other fields as: Conn Type : Oracle Schema : username ( documentation says: use your username for schema ) Login : username Password : * * * After connection is setup, it gives the save error code for every query that I tried to execute ( ORA-12514 ). It seems like oracle doesn't let

Airflow BashOperator log doesn't contain full ouput

阅读更多关于 Airflow BashOperator log doesn't contain full ouput

I have an issue where the BashOperator is not logging all of the output from wget. It'll log only the first 1-5 lines of the output. I have tried this with only wget as the bash command: tester = BashOperator( task_id = 'testing', bash_command = "wget -N -r -nd --directory-prefix='/tmp/' http://apache.cs.utah.edu/httpcomponents/httpclient/source/httpcomponents-client-4.5.3-src.zip", dag = dag) I've also tried this as part of a longer bash script that has other commands that follow wget. Airflow does wait for the script to complete before firing downstream tasks. Here's an example bash script:

Airflow - creating dynamic Tasks from XCOM

阅读更多关于 Airflow - creating dynamic Tasks from XCOM

I'm attempting to generate a set of dynamic tasks from a XCOM variable. In the XCOM I'm storing a list and I want to use each element of the list to dynamically create a downstream task. My use case is that I have an upstream operator that checks a sftp server for files and returns a list of file names matching specific criteria. I want to create dynamic downstream tasks for each of the file names returned. I've simplified it to the below, and while it works I feel like its not an idiomatic airflow solution. In my use case, I would write a python function that's called from a python operator

AssertionError: INTERNAL: No default project is specified

阅读更多关于 AssertionError: INTERNAL: No default project is specified

New to airflow. Trying to run the sql and store the result in a BigQuery table. Getting following error. Not sure where to setup the default_rpoject_id. Please help me. Error: Traceback (most recent call last): File "/usr/local/bin/airflow", line 28, in <module> args.func(args) File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 585, in test ti.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True) File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 53, in wrapper result = func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages

airflow trigger_dag execution_date is the next day, why?

阅读更多关于 airflow trigger_dag execution_date is the next day, why?

问题 Recently I have tested airflow so much that have one problem with execution_date when running airflow trigger_dag <my-dag> . I have learned that execution_date is not what we think at first time from here: Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available. start_date = datetime.combine

How to properly handle Daylight Savings Time in Apache Airflow?

阅读更多关于 How to properly handle Daylight Savings Time in Apache Airflow?

In airflow, everything is supposed to be UTC (which is not affected by DST). However, we have workflows that deliver things based on time zones that are affected by DST. An example scenario: We have a job scheduled with a start date at 8:00 AM Eastern and a schedule interval of 24 hours. Everyday at 8 AM Eastern the scheduler sees that it has been 24 hours since the last run, and runs the job. DST Happens and we lose an hour. Today at 8 AM Eastern the scheduler sees that it has only been 23 hours because the time on the machine is UTC, and doesn't run the job until 9AM Eastern, which is a late

airflow_failover启动scheduler

阅读更多关于 airflow_failover启动scheduler

参考: https://github.com/teamclairvoyant/airflow-scheduler-failover-controller 1.stop failover 2.stop scheduler 3.clear failover meta 4.start failover #启动master上的scheduler . /data/venv/bin/activate supervisorctl status supervisorctl stop airflow_failover supervisorctl stop airflow_scheduler scheduler_failover_controller clear_metadata supervisorctl start airflow_failover #failover命令 scheduler_failover_controller metadata #Get the Metadata from Metastore scheduler_failover_controller clear_metadata #Clear the Metadata in Metastore scheduler_failover_controller is_scheduler_running #Checks if the