airflow | 易学教程

How to run one airflow task and all its dependencies?

阅读更多关于 How to run one airflow task and all its dependencies?

I suspected that airflow run dag_id task_id execution_date would run all upstream tasks, but it does not. It will simply fail when it sees that not all dependent tasks are run. How can I run a specific task and all its dependencies? I am guessing this is not possible because of an airflow design decision, but is there a way to get around this? You can run a task independently by using -i/-I/-A flags along with the run command. But yes the design of airflow does not permit running a specific task and all its dependencies. You can backfill the dag by removing non-related tasks from the DAG for

Install airflow package extras in PyCharm

阅读更多关于 Install airflow package extras in PyCharm

问题 I want to use Airflow package extras s3 and postgres in PyCharm but do not know how to install them (on macOS Sierra). My attempts so far Airflow itself can be installed from Preferences > Project > Project interpreter > + but not the extras as far as I can work out. The extras can be installed with pip in the terminal using $ pip install airflow[s3,postgres] but they end up in a different interpreter ( ~/anaconda ) than the one used by PyCharm ( /usr/local/Cellar/python/2.7.12_2/Frameworks

airflow TriggerDagRunOperator how to change the execution date

阅读更多关于 airflow TriggerDagRunOperator how to change the execution date

问题 I noticed that for scheduled task the execution date is set in the past according to Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available. however, when a dag triggers another dag the execution time is set to now(). Is there a way to have the triggered dags with the same execution time of

Airflow Relative Importing Outside /dag Directory

阅读更多关于 Airflow Relative Importing Outside /dag Directory

问题 I haven't been able to move common code outside of the dag directory that airflow uses. I've looked in the airflow source and found imp.load_source. Is it possible to use imp.load_source to load modules that exist outside of the dag directory? In the example below this would be importing either foo or bar from the common directory. ── airflow_home |──── dags │ ├── dag_1.py │ └── dag_2.py ├── common ├── foo.py └── bar.py 回答1: Just add __init__.py files in all 3 folders. it should work. Infact

Get Pycharm to see dynamically generated python modules

阅读更多关于 Get Pycharm to see dynamically generated python modules

问题 Using Apache airflow they install plugins by modifying sys.modules like this: sys.modules[operators_module.__name__] = operators_module Which is how they get python classes from their plugins folder to be imported via from airflow.operators.plugin_name import Operator even though the class Operator exists inside airflow/plugins/Operators.py This makes it impossible for PyCharm to understand the above import statement because it is a non-traditional way of generating module/module name. Is

Scheduling spark jobs on a timely basis

阅读更多关于 Scheduling spark jobs on a timely basis

Which is the recommended tool for scheduling Spark Jobs on a daily/weekly basis. 1) Oozie 2) Luigi 3) Azkaban 4) Chronos 5) Airflow Thanks in advance. Joe Harris Updating my previous answer from here: Suggestion for scheduling tool(s) for building hadoop based data pipelines Airflow: Try this first. Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird. Airflow has built in support for the fact that jobs scheduled jobs often need to be rerun and/or backfilled. Make sure you build your pipelines to support this. Azkaban: Nice UI,

Airflow + celery or dask. For what, when?

阅读更多关于 Airflow + celery or dask. For what, when?

问题 I read in the official Airflow documentation the following: What does this mean exactly? What do the authors mean by scaling out? That is, when is it not enough to use Airflow or when would anyone use Airflow in combination with something like Celery? (same for dask ) 回答1: In Airflow terminology an "Executor" is the component responsible for running your task. The LocalExecutor does this by spawning threads on the computer Airflow runs on and lets the thread execute the task. Naturally your

Apache Airflow unable to establish connect to remote host via FTP/SFTP

阅读更多关于 Apache Airflow unable to establish connect to remote host via FTP/SFTP

I am new to Apache Airflow and so far, I have been able to work my way through problems I have encountered. I have hit a wall now. I need to transfer files to a remote server via sftp. I have not had any luck doing this. So far, I have gotten S3 and Postgres/Redshift connections via their respective hooks to work in various DAGs. I have been able to use the FTPHook with success testing on my local FTP server, but have not been able to figure out how to use SFTP to connect to a remote host. I am able to connect to the remote host via SFTP with FileZilla, so I know my credentials are correct.

Airflow: PythonOperator: why to include 'ds' arg?

阅读更多关于 Airflow: PythonOperator: why to include 'ds' arg?

问题 While defining a function to be later used as a python_callable, why is 'ds' included as the first arg of the function? For example: def python_func(ds, **kwargs): pass I looked into the Airflow documentation, but could not find any explanation. 回答1: This is related to the provide_context=True parameter. As per Airflow documentation, if set to true, Airflow will pass a set of keyword arguments that can be used in your function. This set of kwargs correspond exactly to what you can use in your

Running Job On Airflow Based On Webrequest

阅读更多关于 Running Job On Airflow Based On Webrequest

问题 I wanted to know if airflow tasks can be executed upon getting a request over HTTP. I am not interested in the scheduling part of Airflow. I just want to use it as a substitute for Celery. So an example operation would be something like this. User submits a form requesting for some report. Backend receives the request and sends the user a notification that the request has been received. The backend then schedules a job using Airflow to run immediately. Airflow then executes a series of tasks