Fact table with information that is regularly updatable in source system

问题

I'm building a dimensional data warehouse and learning how to model my various business processes from my source system in my warehouse.

I'm currently modelling a "Bid" (bid for work) from our source system in our data warehouse as a fact table which contains information such as:

Bid amount
Projected revenue
Sales employee
Bid status (active, pending, rejected, etc)
etc.

The problem is that the bid (or most any other process I'm trying to model) can go through various states and have its information updated at any given moment in the source system. According to Ralph Kimball, fact tables should only be updated if they are considered "accumulating snapshot" and I'm sure that not all of these processes would be considered an "accumulating snapshot" by the definition below.

How should these type of processes be modeled in the data warehouse according to the Kimball group's recommendations? Further more, what type of fact table would work for a bid (given the facts I've outlined above)?

Excert from http://www.kimballgroup.com/2008/11/fact-tables/

The transaction grain corresponds to a measurement taken at a single instant. The grocery store beep is a transaction grain. The measured facts are valid only for that instant and for that event. The next measurement event could happen one millisecond later or next month or never. Thus, transaction grain fact tables are unpredictably sparse or dense. We have no guarantee that all the possible foreign keys will be represented. Transaction grain fact tables can be enormous, with the largest containing many billions of records.

The periodic snapshot grain corresponds to a predefined span of time, often a financial reporting period. Figure 1 illustrates a monthly account periodic snapshot. The measured facts summarize activity during or at the end of the time span. The periodic snapshot grain carries a powerful guarantee that all of the reporting entities (such as the bank account in Figure 1) will appear in each snapshot, even if there is no activity. The periodic snapshot is predictably dense, and applications can rely on combinations of keys always being present. Periodic snapshot fact tables can also get large. A bank with 20 million accounts and a 10-year history would have 2.4 billion records in the monthly account periodic snapshot!

The accumulating snapshot fact table corresponds to a predictable process that has a well-defined beginning and end. Order processing, claims processing, service call resolution and college admissions are typical candidates. The grain of an accumulating snapshot for order processing, for example, is usually the line item on the order. Notice in Figure 1 that there are multiple dates representing the standard scenario that an order undergoes. Accumulating snapshot records are revisited and overwritten as the process progresses through its steps from beginning to end. Accumulating snapshot fact tables generally are much smaller than the other two types because of this overwriting strategy.

回答1:

Like one of the comments mention, Change Data Capture is a fairly generic term for "how do I handle changes to data entities over time", and there are entire books on it (and a gazillion posts and articles).

Regardless of any statements that seem to suggest a clear black-and-white or always-do-it-like-this answer, the real answer, as usual, is "it depends" - in your case, on what grain you need for your particular fact table.

If your data changes in unpredictable ways or very often, it can become challenging to implement Kimball's version of an accumulated snapshot (picture how many "milestone" date columns, etc. you might end up needing).

So, if you prefer, you can decide to make your fact table be an transactional fact table rather than a snapshot, where the fact key would be (Bid Key, Timestamp), and then in your application layer (whether a view, mview, actual app, or whatever), you can ensure that a given query only gets the latest version of each Bid (note that this can be thought of as kind of a virtual accumulated snapshot). If you find that you don't need the previous versions (the history of each Bid), you can have a routine that prunes them (i.e. deletes or moves them somewhere else).

Alternatively, you can only allow the fact (Bid) to be added when it is in it's final state, but then you will likely have a significant lag where a new (updateable) Bid doesn't make it to the fact table for some time.

Either way, there are several solid and proven techniques for handling this - you just have to clearly identify the business requirements and design accordingly.

Good luck!

来源：https://stackoverflow.com/questions/26351343/fact-table-with-information-that-is-regularly-updatable-in-source-system

标签

sql

data-warehouse

dimensional-modeling