Managing surrogate keys in a data warehouse

问题

I want to build a data warehouse, and I want to use surrogate keys as primary keys for my fact tables. But the problem is that in my case fact tables should be updated.

The first question is how do I find a corresponding auto-generated surrogate key for the natural key in the source system? I have seen some answers mentioning lookup tables which store correspondence between natural and surrogate keys, but I didn't understand how exactly they are implemented. Where this table should be stored: in the data warehouse itself, or somewhere else?

There is also a second question. The source system already contains surrogate keys for facts, but they have UUID data type which is 16 bytes. And the number of facts is very unlikely to exceed maximum integer value (4 bytes). Should I use UUIDs provided by the source system to simplify ETL, or should I do more complex ETL and implement my own integer surrogate keys for better performance?

回答1:

I think the rest is answered already. I'd give you my 2 cents about managing and maintaining surrogate keys.

I worked with surrogate keys a lot during my time at Teradata. Here are a few best practices I learned over the years about surrogate keys.

You generate surrogate keys only from an approved master source (in your case a particular API. Not many APIs should be allowed to generate the same domain values. Pick the one that acts as master for your domain values. e.g. Customer No is usually coming from CRM systems and not likely from billing systems as a master)
You generate & store these in indeed a lookup table (lets call it Customer_SGK). Generally these surrogate key tables are not part of your final LDM/PDM in either inmon or kimbal approaches. These reside within the same database server but rather in a technical metadata schema. Let's call that schema "My_Tec_Schema"
In such a Lookup table you would have the surrogate key column (e.g. Customer_ID), source natural key column(s) per each master source (source1_customerNO, source2_customerNO) and a timestamp to keep a trail of when this key was generated.
Your PK is Customer_ID which may not be unique in this column so depending upon data storage technology used you may have to classify it as Unique or NonUnique Primary Index / Key (for instance in Teradata it would be a NUPI).
You sometimes have to allow this to ease your ETL processes while loading same Customer ID for two different natural keys coming from 2 different source systems but they both mean the same customer.
Having this lookup table, you would want to load it (generate keys) from your stage tables / sources the first thing in your ETL processes. Then you load from your stage Left Outer Join with Lookup table to get your Surrogate Key and load that into your fact table and hopefully also your natural keys. (you always want to have them because most often you will get some orphans in your fact tables and the only fast & reliable way to recover that situation is to have your natural keys in your fact table and to use PK or PI or an Index based Update operation which is very quick rather than full table scans)
You can always hide your natural keys in your fact table via a presentation layer view (a view that is used by consuming applications & users while keeping your table for ETL purpose / technical people only)
Since you use auto-number generation technique; you will have to pay special attention while migrating data from one environment to another and also while migrating production data during a major release. (you don't want to have collisions)

I can go on and on on Surrogate Keys. Please ask any specific question having read this high level overview. I'd be glad to help.

回答2:

It looks like your question is: If I am generating a surrogate key in my data warehouse on the initial load of a row, how do I determine if a key has already been generated on subsequent loads? Should a lookup table be created and if so where would it be located?

Note: If at all possible include the key from the source system in your data warehouse target table, even if you don't think you'll need it. It will prove invaluable for troubleshooting ETL issues.

Two straightforward options:

1. Perform the lookup directly against the target table (performance may be an issue on large tables).

2. Create an "etl staging lookup" table which is used only by your ETL process (but is stored in your data warehouse). This is the more flexible option but adds an additional step to your ETL.

来源：https://stackoverflow.com/questions/47948372/managing-surrogate-keys-in-a-data-warehouse

标签

database-design

etl

data-warehouse