airflow

Copy files from one Google Cloud Storage Bucket to other using Apache Airflow

本秂侑毒 提交于 2019-12-06 15:27:33
Problem : I want to copy files from a folder in Google Cloud Storage Bucket (e.g Folder1 in Bucket1) to another Bucket (e.g Bucket2). I can't find any Airflow Operator for Google Cloud Storage to copy files. I know this is an old question but I found myself dealing with this task too. Since I'm using the Google Cloud-Composer, GoogleCloudStorageToGoogleCloudStorageOperator was not available in the current version. I managed to solve this issue by using a simple BashOperator from airflow.operators.bash_operator import BashOperator with models.DAG( dag_name, schedule_interval=timedelta(days=1),

How do I trigger an Airflow DAG via the REST API?

北慕城南 提交于 2019-12-06 13:24:11
The 1.10.0 documentation says I should be able to make a POST against /api/experimental/dags//dag_runs to trigger a DAG run, but instead when I do this, I receive an error: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <title>400 Bad Request</title> <h1>Bad Request</h1> <p>The browser (or proxy) sent a request that this server could not understand.</p> To make this work, I figured out that I needed to send an empty JSON string in the body: curl -X POST \ http://airflow.dyn.fa.disney.com/api/experimental/dags/people_data/dag_runs \ -H 'Cache-Control: no-cache' \ -d '{}' 来源: https:/

How to parse json string in airflow template

左心房为你撑大大i 提交于 2019-12-06 13:19:00
Is it possible to parse JSON string inside an airflow template? I have a HttpSensor which monitors a job via a REST API, but the job id is in the response of the upstream task which has xcom_push marked True . I would like to do something like the following, however, this code gives the error jinja2.exceptions.UndefinedError: 'json' is undefined t1 = SimpleHttpOperator( http_conn_id="s1", task_id="job", endpoint="some_url", method='POST', data=json.dumps({ "foo": "bar" }), xcom_push=True, dag=dag, ) t2 = HttpSensor( http_conn_id="s1", task_id="finish_job", endpoint="job/{{ json.loads(ti.xcom

Suggestion for scheduling tool(s) for building hadoop based data pipelines

允我心安 提交于 2019-12-06 12:44:05
Between Apache Oozie, Spotify/Luigi and airbnb/airflow , what are the pros and cons for each of them? I have used oozie and airflow in the past for building a data ingestion pipeline using PIG and Hive. Currently, I am in the process of building a pipeline that looks at logs and extracts out useful events and puts them on redshift. I found that airflow was much easier to use/test/setup. It has a much cooler UI and lets users perform actions from the UI itself, which is not the case with Oozie. Any information about Luigi or other insights regarding stability and issues are welcome. Azkaban:

Running more than 32 concurrent tasks in Apache Airflow

喜你入骨 提交于 2019-12-06 12:33:51
I'm running Apache Airflow 1.8.1. I would like to run more than 32 concurrent tasks on my instance, but cannot get any of the configurations to work. I am using the CeleryExecutor, the Airflow config in the UI shows 64 for parallelism and dag_concurrency and I've restarted the Airflow scheduler, web server and workers numerous times (I'm actually testing this locally in a Vagrant machine, but have also tested in on an EC2 instance). airflow.cfg # The amount of parallelism as a setting to the executor. This defines # the max number of task instances that should run simultaneously # on this

Apache Airflow unable to establish connect to remote host via FTP/SFTP

谁都会走 提交于 2019-12-06 11:01:52
问题 I am new to Apache Airflow and so far, I have been able to work my way through problems I have encountered. I have hit a wall now. I need to transfer files to a remote server via sftp. I have not had any luck doing this. So far, I have gotten S3 and Postgres/Redshift connections via their respective hooks to work in various DAGs. I have been able to use the FTPHook with success testing on my local FTP server, but have not been able to figure out how to use SFTP to connect to a remote host. I

Airflow Audit Logs

老子叫甜甜 提交于 2019-12-06 09:39:44
I'm wondering what Airflow offers in the sense of Audit Logs. My Airflow environment is running Airflow version 1.10 and uses the [ldap] section of the airflow.cfg file to use my companies Active Dicrectory (AD) for authentication. I see when someone logs into Airflow through the Web UI it writes the users name into the webserver's log (shown below). I'm wondering though if Airflow can be modified to also log when the user turns on/off a DAG, creates a new Airflow Variable or Pool, Clears a Task, marks a Task as Success, and any other operation that a user can do. I need to be able to have

Test Dag run for Airflow 1.9 in unittest

大城市里の小女人 提交于 2019-12-06 09:05:48
I had implemented test case for running an individual dag but it does not seem to work in 1.9 and may be due to stricter pool which got introduced in airflow 1.8 . I am trying to run below test case: from airflow import DAG from airflow.operators.dummy_operator import DummyOperator from airflow.operators.python_operator import PythonOperator class DAGTest(unittest.TestCase): def make_tasks(self): dag = DAG('test_dag', description='a test', schedule_interval='@once', start_date=datetime(2018, 6, 26), catchup=False) du1 = DummyOperator(task_id='dummy1', dag=dag) du2 = DummyOperator(task_id=

Airflow: how to use xcom_push and xcom_pull in non-PythonOperator

佐手、 提交于 2019-12-06 08:58:21
I see a lot of examples on how to use xcom_push and xcom_pull with PythonOperators in Airflow. I need to do xcom_pull from a non-PythonOperator class and couldn't find how to do it. Any pointer or example will be appreciated! You can access XCom variables from within templated fields. For example, to read from XCom: myOperator = MyOperator( message="Operation result: {{ task_instance.xcom_pull(task_ids=['task1', 'task2'], key='result_status') }}", ... It is also possible to not specify task to get all XCom pushes within one DagRun with the same key name myOperator = MyOperator( message=