Database design for incremental “export” to data warehouse

问题

Given a 1 TB relational database, currently in SQL Server. The data warehouse needs a "copy" of major parts of the database. The warehouse data should not be more than 24 hours old. The size of the relational database makes it impractical to do a full load every night. How should I design my relational database to support incremental load to the warehouse?

A very small portion (<0.1%) of the database changes in a single day, and it is mostly inserts. The intra-day changes are not required, just the final snapshot.

Maintaining the performance of the relational database is the top concern, followed by not wasting unnecessary space.

回答1:

There are a number of ways to handle incremental pulls. There are volumes written on various methods and scenarios, so what I can give you is an example of an approach.

For inserts, use a monotonically increasing key to track your high-water mark for each pull. Before pulling the data, check your target tables for the maximum value and then pull from the source where the value is greater than it.
For updates, base your incrementals on a "last-modified" timestamp. At the end of each batch, you'll want to note the latest timestamp and store it where you can pick it up for the next batch.
Deletes are harder in incrementals. I would recommend that you keep a simple audit table per deletable table where you can track the key values for those rows deleted. For each batch, pull based on the previous batch's high-water mark and then take the applicable action in your target system. In some cases, like safe-harbor actions, you may choose to physically delete rows from your target system. You may choose to simply mark the target record as inactive. It depends on the rules you've set up.

This is not the only way to do it, of course. But hopefully it provides you with some applicable context.

回答2:

Do you need to capture intra-day changes or do you just require a snapshot of the current state at the end of each day?

If a snapshot is acceptable then you could timestamp each row when it is updated so that you can identify the changes. If you need all the intra-day changes then look into some kind of change data capture (CDC) solution. Some DBMSs have CDC/logging features built-in and there are third-party tools that do the same job as well. Typically they will scrape the redo logs without accessing tables directly so as to minimise resource contention on the source system.

回答3:

This is a tricky area - commonly referred to as "Extract, Transform, Load" or ETL. There's no right answer, and none of the books I've found have been all that convincing - Ralph Kimball seems to write the most useful ones.

As a start, I'd suggest looking at adding timestamp columns to your relational system; you could then create nightly queries to extract data younger than the last successful run. You might want to create additional tables to store the transfer status - so a record in your source table should have a corresponding record in the transfer table; if that record doesn't exist, it means the record hasn't been transferred out.

If your transactional data model is heavily normalized, managing dependencies could be tricky - you have to migrate all the foreign key values first, which could lead to long dependency chains.

If the performance suffers, you may need to look at using a mirror of your transactional database to run the ETL tasks - this is a whole new layer of complexity.

I'd read the Kimball books first, and see if any ideas look directly applicable.

来源：https://stackoverflow.com/questions/5402141/database-design-for-incremental-export-to-data-warehouse

标签

database-design

data-warehouse

etl