问题
As a part of my Spark
pipeline, I have to perform following tasks on EMR
/ S3
:
- Delete: (Recursively) Delete all files / directories under a given
S3 bucket
- Copy: Copy contents of a directory (subdirectories & files) to a given
S3 bucket
Based on my current knowledge, Airflow
doesn't provide operator
s / hook
s for these tasks. I therefore plan to implement them as follows:
- Delete: Extend S3Hook to add a function that performs
aws s3 rm
on specifiedS3 bucket
- Copy: Use SSHExecuteOperator to perform
hadoop distcp
My questions are:
- I reckon that the tasks I intend to perform are quite primitive. Are these functionalities already provided by
Airflow
? - If not, is there a better way to achieve this than what I plan to do?
I'm using:
Airflow 1.9.0
[Python 3.6.6
] (will upgrade toAirflow 1.10
once it is released)EMR 5.13.0
回答1:
Well the delete
is a primitive operation yes but not the hadoop distcp
. To answer your questions:
- No airflow does not have functions on the s3 hook to perform these actions.
- By creating your own plugin to extend the s3_hook and also using the ssh operator to perform the distcp is, in my opinion, a good way to do this.
Not sure why the standards S3_Hook does not have a delete function. It MAY be because s3 provides an "eventually consistent" Consistency Model (probably not the reason but good to keep in mind anyway)
来源:https://stackoverflow.com/questions/51703173/s3-delete-hdfs-to-s3-copy