Tuesday, July 23, 2013

Slowly Changing Dimensions Type X

I was recently able to attend the ETL Architecture class at the Kimball University in Chicago.  It was truly fascinating, especially the discussion about the new types of slowly changing dimensions outlined in the new edition of the Data Warehouse Toolkit - there are something like 9 of them.  The most interesting part was the discussion about the changing nature of dimensions and how they should be considered on a per attribute basis, not as a single quality of the entire record.  For example, all of our dimensions where I work are type II - meaning the entire record changes every time  any attribute changes - even though certain attributes like store open date are static and not going to change (trust me on this one - I know there are a lot of scenarios where it could be considered open to change, but in our case it is immutable even if a store closes and reopens).

The eye-opener for me was the concept of slowly changing attributes, not the entire dimensions themselves.  Take for example a store table with attributes store_id, open_date and market id.  store_id is our natural key, and we will add effective and end dates to round it out - like so:

Since the record is slowly changing type II every time open date or market changes a new record is created with the appropriate effective and end dates.  In our current implementation, that is the end of the story - and the end users are left dealing with the age-old current versus historical versions whether they want to or not.

Treating this as a mixed-type record, however, we can define Store.market as type II and Store.open_date as a type I attribute.  The implications are that any time the open date changes it is an error correction and so therefore must be changed on every version of that dimensional record.  If market changes, that is something we care about capturing so a new store record is created with the same store_number and different dates.

That is an elegant solution, but requires that IT decide (perhaps with the help of specific business users) which attributes are interesting historically and which are not.  Furthermore, the users are prone to change their minds, leaving you with either a problem (type I where type II is required) or crappy queries (type II where type I is desired).

What we have started doing internally I think answers both needs and allows our users to mix and match type I and type II attributes with a minimal increase in query complexity and a little ETL - which fits my M/O, make the ETL do the hard stuff and rigorously enforce correctness so that users of your schemas have a relatively straightforward job.  By the way, this is where ETL unit testing is invaluable.  The solution I found to this problem uses what I call Durable Dimension Keys.

Slowly Changing Dimensions with Durable Keys


To more completely demonstrate the issues our users are facing, consider how a business user might utilize a fact table tied to the store dimension:

Our sales fact is keyed to a type II slowly changing dimension, which means that every store sk on the sales fact corresponds to the date the fact occurred.  E.G., a fact which occurred on 01-01-2013 would include the surrogate key for the store record which was active on 01-01-2013 (indicated by the effective and end dates).

To illustrate the issues we have faced, let's define a business need to run sales by market (At Monkeys 'R Us that's pretty important).  Initially it's easy enough:

Select
  dim.Market ID,
  SUM(fact.Sales)
From
 Sales Fact fact
Join
 Store Dim dim
  on
   fact.Store Sk = dim.Store Sk
Group By
 dim.Market ID

Implicit in this design is that the report is providing a total of sales by historical market alignments - I.E., what market a store was considered in when the fact occurred.  That's the beauty of type II dimensions.

Our business analysts however just came into our office complaining that the Data Warehouse is misrepresenting sales as compared to our ODS.  After much consternation and head-scratching we uncover that the business recently realigned our stores and markets and neglected to inform anyone in IT (I know, I know, never happens . . .).  We quickly scrap out an updated report to show year-over-year sales by the current market alignments:

Select
 curr_dim.Market ID,
 SUM(fact.Sales)
From
 Sales Fact fact
Join
 Store Dim hist_dim
  on
   fact.Store Sk = hist_dim.Store Sk
Join
 Store Dim curr_dim
  on
   curr_dim.Store ID = hist_dim.Store ID
   AND
   curr_dim.end_date IS NULL
Group By
 curr_dim.Market ID

Sidebar:  our company standard is to use a null end date to indicate the current generation of an SCD-II dimension.

This is somewhat less than desirable, and going to make your queries slow.  Furthermore, you have reduced your nice clean star schema into a snowflake with tables forming self joins.  Adding another store reference makes this design cleaner but requires a little trickery:


Here I have added a durable store key, which is defined as a key which is unique to that dimension (E.G., a natural key, but always represented as an integer regardless of whether the natural key is a compound key or what the data types are) and can only be a positive integer on exactly one record - the current generation - every other generation of that object must have a negative value for that key.  Note that this is in the dimension, not the fact - the fact only ever contains the valid durable key for the dimension and the surrogate key.  In this usage the current store dimension is really a role-playing dimension since there are two keys in the fact pointing into the same dimension with different semantic meanings.

Every single fact is loaded the same way:  Look up the store sk corresponding to when the event occurred, then find the durable store key from the current generation and record both.

The dimension is more complex.  Let's consider when store number 1 is loaded for the first time.  When the initial load happens, the record is created with an SK e.g. 1000, a store number of 1 and a durable store key of 1 (reusing the natural key in this case).  When store number 1 changes, the first record is updated to set the end date, and also set the durable store key to -1.  A new record is then inserted with store sk e.g. 1001, store number 1 and durable store key of 1.  As the generations proceed there will be exactly one record with each sk, potentially many records with each natural key, and only one record per natural key with a positive durable key.

The purpose in all of this is to allow report consumers to decide which versions of each dimension attribute to use without (much) additional complexity - certainly in keeping with the goals of star schemas - and the fact record never has to be updated.  The historical reporting requirement mentioned above in the first example remains unchanged, and the second example can now be expressed as:

Select
 curr_dim.Market ID,
 SUM(fact.Sales)
From
 Sales Fact fact
Join
 Store Dim curr_dim
  on
   curr_dim.Durable Store Key = fact.Durable Store Key
Group By
 curr_dim.Market ID

And the users which want to push the envelope and mix and match current and historical attributes can continue in the same way:

Select
 curr_dim.Market ID,
 hist_dim.Open Date,
 SUM(fact.Sales)
From
 Sales Fact fact
Join
 Store Dim hist_dim
  on
   fact.Store Sk = hist_dim.Store Sk
Join
 Store Dim curr_dim
  on
   fact.Durable Store Key = curr_dim.Durable Store Key
Group By
 curr_dim.Market ID
 hist_dim.Open Date

This query looks very similar to the one I said was bad before, but now all joins are uni-directional - fact to dim - and there are no self-joins.  Any reporting tool will support querying against that schema and produce simple queries.

Update:  So, I just read the chapter "“Hybrid Slowly Changing Dimension Techniques” Excerpt From: Ralph Kimball & Margy Ross. “The Data Warehouse Toolkit.” iBooks. https://itun.es/us/-psMO.l", the Third edition (7/2013), and what I describe here is very similar to what Kimball calls a Type 7 dimension, except that I require the durable key to be populated on only the current generation, and he used views or some other mechanism to work it out as a logical role-playing dimension.  I prefer my approach with respect to the handling of the durable key since it can easily be indexed and is straightforward to implement - and if referenced to the original dimension instead of the role-playing view it still works.  I guess there truly is nothing new under the sun . . .

Update 2:  I have recently changed my approach on the durable key value to use negative integers for historical records, and positive integers for the current generation.  This has the benefit of allowing us to put a unique index and foreign key constraint on the key.

3 comments:

  1. Ugh. After publishing this entire article I really don't like the editor. I am used to a WIKI which has very nice formatting such as code snippets, etc. This thing seems appropriate for posting a paragraph of plain text and nothing more.

    ReplyDelete
  2. In the next-to-the-last code example, the third-to-the-last line, shouldn't the right hand equation be 'fact.Durable Store Key?'

    ReplyDelete
  3. Good catch. Thanks to the second person to read this post after me!

    ReplyDelete