Copying only new records from AWS DynamoDB to AWS Redshift

家住魔仙堡 提交于 2020-01-02 02:37:10

问题


I see there is tons of examples and documentation to copy data from DynamoDB to Redshift, but we are looking at an incremental copy process where only the new rows are copied from DynamoDB to Redshift. We will run this copy process everyday, so there is no need to kill the entire redshift table each day. Does anybody have any experience or thoughts on this topic?


回答1:


Dynamo DB has a feature (currently in preview) called Streams:

Amazon DynamoDB Streams maintains a time ordered sequence of item level changes in any DynamoDB table in a log for a duration of 24 hours. Using the Streams APIs, developers can query the updates, receive the item level data before and after the changes, and use it to build creative extensions to their applications built on top of DynamoDB.

This feature will allow you to process new updates as they come in and do what you want with them, rather than design an exporting system on top of DynamoDB.

You can see more information about how the processing works in the Reading and Processing DynamoDB Streams documentation.




回答2:


The copy from redshift can only copy the entire table. There are several ways to achieve this

  1. Using an AWS EMR cluster and Hive - If you set up an EMR cluster then you can use Hive tables to execute queries on the dynamodb data and move to S3. Then that data can be easily moved to redshift.

  2. You can store your dynamodb data based on access patterns (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns). If we store the data this way, then the dynamodb tables can be dropped after they are copied to redshift




回答3:


This can be solved with a secondary DynamoDB table that tracks only the keys that were changed since the last backup. This table has to be updated wherever initial DynamoDB table is updated (add, update, delete). At the end of a backup process you will delete them or after you backup a row (one by one).




回答4:


If your DynamoDB table can have

Timestamps as an attribute or

A binary flag which conveys data freshness as attribute

then you can write a hive query to export only current day's data or fresh data to s3 and then 'KEEP_EXISTING' copy this incremental s3 data to Redshift.



来源:https://stackoverflow.com/questions/20980072/copying-only-new-records-from-aws-dynamodb-to-aws-redshift

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!