RDS to S3 - Data Transformation AWS

半城伤御伤魂 提交于 2021-02-11 12:31:32

问题


I have about 30 tables in my RDS postgres / oracle (haven't decided if it is oracle or postgres yet) instance. I want to fetch all the records that have been inserted / updated in the last 4 hours (configurable) , create a csv file pertaining to each table and store the files in S3. I want this whole process to be transactional. If there is any error in fetching data from one table , I don't want data pertinent to other 29 tables to be persisted in S3. The data isn't very large , it should be in the order of few 100 records or less in each table for the duration of 4 hours.

I am thinking of having a spark job in EMR cluster to fetch data from RDS , create a csv for each table and post all the files to S3 at the end of the process. The EMR cluster will be destroyed once data is posted to S3. A cloudwatch trigger will invoke a lamda every 4 hours which will spin up a new EMR cluster which performs this job.

Are there any alternate approaches worth exploring for this transformation?


回答1:


Take a look at AWS Glue which is using EMR under the hood but you don't need to care about infrastructure and configurations, just setup crawler and write your ETL job.

Please note that AWS Glue doesn't support predicates pushdown for JDBC connections (currently s3 only) so it means it will load entire table first and only then apply filtering.

Also you should carefully think about atomicity since Glue ETL job simply processes data and writes to a sink without transactions. In case of failure it won't remove partially written records so you should manage it by yourself. There are few options I would consider:

  1. Write data into temp folder (local or s3) per each execution and then move objects to final destination with aws s3 sync command or copy data using TransferManager from AWS SDK
  2. Write data to the final destination into dedicated folder and in case of failure delete it using CLI or SDK


来源:https://stackoverflow.com/questions/50361589/rds-to-s3-data-transformation-aws

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!