I\'m still in the process of deploying Airflow
and I\'ve already felt the need to merge operator
s together. The most common use-case would
I have combined various hooks to create a Single operator based on my needs. A simple example is I clubbed gcs delete, copy, list method and get_size methods in hook to create a single operator called GcsDataValidationOperator
. A rule of thumb would be to have Idempotency i.e. if you run multiple times it should produce the same result.
Should operators be composed at all or is it better to have discrete steps?
The only pitfall is maintainability, sometimes when the hooks change in the master branch, you will need to update all your operator manually if there are any breaking changes.
Any pitfalls, improvements in above approaches?
You can use PythonOperator
and use the in-built hooks with .execute
method, but it would still mean a lot of details in the DAG file. Hence, I would still go for a new operator approach
Any other ways to combine operators together?
Hooks are just interfaces to external platforms and databases like Hive, GCS, etc and form building blocks for operators. This allows the creation of new operators. Also, this mean you can customize templated field, add slack notification on each granular step inside your new operator and have your own logging details.
In taxonomy of Airflow, is the primary motive of Hooks same as above, or do they serve some other purposes too?
FWIW: I am the PMC member and a contributor of the Airflow project.