S3 Delete & HDFS to S3 Copy

佐手、 提交于 2019-12-11 15:43:09

问题


As a part of my Spark pipeline, I have to perform following tasks on EMR / S3:

  1. Delete: (Recursively) Delete all files / directories under a given S3 bucket
  2. Copy: Copy contents of a directory (subdirectories & files) to a given S3 bucket

Based on my current knowledge, Airflow doesn't provide operators / hooks for these tasks. I therefore plan to implement them as follows:

  1. Delete: Extend S3Hook to add a function that performs aws s3 rm on specified S3 bucket
  2. Copy: Use SSHExecuteOperator to perform hadoop distcp

My questions are:

  • I reckon that the tasks I intend to perform are quite primitive. Are these functionalities already provided by Airflow?
  • If not, is there a better way to achieve this than what I plan to do?

I'm using:

  • Airflow 1.9.0 [Python 3.6.6] (will upgrade to Airflow 1.10 once it is released)
  • EMR 5.13.0

回答1:


Well the delete is a primitive operation yes but not the hadoop distcp. To answer your questions:

  1. No airflow does not have functions on the s3 hook to perform these actions.
  2. By creating your own plugin to extend the s3_hook and also using the ssh operator to perform the distcp is, in my opinion, a good way to do this.

Not sure why the standards S3_Hook does not have a delete function. It MAY be because s3 provides an "eventually consistent" Consistency Model (probably not the reason but good to keep in mind anyway)



来源:https://stackoverflow.com/questions/51703173/s3-delete-hdfs-to-s3-copy

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!