Code, Data and Other Things: 2013

Tuesday, December 17, 2013

Test Driven ETL Development

I did a screencast showing how my team uses Test Driven Development to develop ETL code. Here is the link:

Test Driven ETL Development

The presentation is a bit long because I discuss some of the theory of what we do and demonstrate the entire end-to-end process.

Friday, November 22, 2013

Siri, Punctuated

I have been a little tired of dictating text messages to Siri and having them come across bland - having no punctuation to speak of. I would ask "How are you?", and she would transcribe "How are you." On a hunch I tried speaking the appropriate punctuation and she picked it right up.

I tried:

Me: "Where are you. Where are you question mark. Where are you exclamation point."
Siri: "Where are you. Where are you? Where are you!"

Just for fun, I also did these:

Me: "Are you quote happy unquote question mark."
Siri: 'Are you "happy"?'

Me: "I picked up the kids semicolon they are very whiny."
Siri: "I picked up the kids; they are very whiny."

Me: "Are you colon happy comma unhappy comma or sad question mark."
Siri: "Are you: happy, unhappy, or sad?"

Anyway - pretty cool stuff.

This is on iOS 7 - not sure if it works elsewhere.

Maven assemblies with file permissions (modes)

I have this maven project which packages up some shell scripts and batch files in a plugin. On the client side, they get extracted into a local directory where they can be used for various functions. Long story short, I was a bit irritated that on windows the batch files are treated as executables, but on Mac OS X and Linux, they were simply text files, and had to be launched with:

sh etlunit

Granted this is not a big deal, but I was annoyed anyway. So, I Googled and found a directive to the maven assembly plugin which supports setting the file mode.

<fileSets>
<fileSet>
<directory>src/bin/</directory>
<outputDirectory>bin</outputDirectory>
<fileMode>0755</fileMode>
</fileSet>

</fileSets>

I tried that, and for some reason now my bin directory (created by the plugin) was inaccessible - it's mode had changed inadvertently - to 0000. On a hunch I added the directory mode as well:

<fileSets>
<fileSet>
<directory>src/bin/</directory>
<outputDirectory>bin</outputDirectory>
<fileMode>0755</fileMode>
<directoryMode>0755</directoryMode>
</fileSet>

</fileSets>

And now I feel much better about the universe. I can add the scripts to the path or use:

./etlunit

It's little victories like these that make the Open Source Software world so much fun . . .

Wednesday, October 9, 2013

Revenge, Justice, Mercy and Grace

[Disclaimer - this blog entry is about the Christian doctrine of grace]

I heard an illustration by Pastor Doyle today which contrasted these four ideas. I thought it was very powerful.

Suppose a man breaks into your home and brutally murders your only son. Your response can fall into these categories.

You hunt him down in anger and brutally murder him, inflicting the exact pain he caused your son. This is Revenge.

You hunt him down, but once apprehended you turn him over to the authorities for punishment. Once pronounced guilty, he is executed according to the law. This is Justice.

You hunt him down, this time you forgive him and leave him to live in his guilt and shame. This is Mercy.

As before, you hunt this man down, and having detained him you extend forgiveness, but this time you invite him into your home to live in your son's bedroom, take his place at your table, and become your legal heir. You love him as an adopted son and he is called by your name. This is Grace.

The point drives home for me when I think of myself as the brutal murderer. It's easy to think of oneself as the victim - the one extending forgiveness - but in God's sight we are all offenders differing only in degrees.

Romans 3:22-24

So, it is God who extends grace to us through his Son, Jesus - He is the only one who has the right to.

Saturday, October 5, 2013

iOS 7 music and screen locks

Found my first really annoying iOS 7 bug. If I have music playing and the phone locks, when I unlock the phone the music jumps or starts over.

Especially annoying when shopping!

Time Machine Saved My Behind, Twice.

For the second time in two years, the drive on my iMac died. The first one was a bad drive, I suspect the second was because I forgot to attach the heat sensor to the new drive.

Anyway, after a completely lost drive several years back I bought a Drobo and set up a RAID volume for TimeMachine - Apple's backup solution for Macs. So, when I woke up Tuesday morning to a clicking drive and a sad Mac, I only had to install a new drive, boot it into the recovery partition on my Drobo, and tell it to restore from the TimeMachine backup. After 50 hours (!) my OS, computer accounts, my wife's photo library and my iTunes library featuring hundreds of movies, etc., were all restored.

Pretty cool.

Friday, October 4, 2013

So, I thought IntelliJ IDEA was screwed up . . .

My maven projects all started behaving badly. No auto-completion, no auto-importing of classes, syntax highlighting wasn't working - I was getting really annoyed. My initial thought was that the 12.1.5 update released last week must have been the culprit.

I was about to open a support case with JetBrains, when I noticed that an option called "Power Save Mode" was enabled (File menu, towards the bottom). I turned that off, and presto! Everything came back to normal.

Bottom line - don't use this feature. It makes IDEA slightly better than a text editor - especially in comparison to what it normally does for you.

Tuesday, July 23, 2013

Slowly Changing Dimensions Type X

I was recently able to attend the ETL Architecture class at the Kimball University in Chicago. It was truly fascinating, especially the discussion about the new types of slowly changing dimensions outlined in the new edition of the Data Warehouse Toolkit - there are something like 9 of them. The most interesting part was the discussion about the changing nature of dimensions and how they should be considered on a per attribute basis, not as a single quality of the entire record. For example, all of our dimensions where I work are type II - meaning the entire record changes every time any attribute changes - even though certain attributes like store open date are static and not going to change (trust me on this one - I know there are a lot of scenarios where it could be considered open to change, but in our case it is immutable even if a store closes and reopens).

The eye-opener for me was the concept of slowly changing attributes, not the entire dimensions themselves. Take for example a store table with attributes store_id, open_date and market id. store_id is our natural key, and we will add effective and end dates to round it out - like so:

Since the record is slowly changing type II every time open date or market changes a new record is created with the appropriate effective and end dates. In our current implementation, that is the end of the story - and the end users are left dealing with the age-old current versus historical versions whether they want to or not.

Treating this as a mixed-type record, however, we can define Store.market as type II and Store.open_date as a type I attribute. The implications are that any time the open date changes it is an error correction and so therefore must be changed on every version of that dimensional record. If market changes, that is something we care about capturing so a new store record is created with the same store_number and different dates.

That is an elegant solution, but requires that IT decide (perhaps with the help of specific business users) which attributes are interesting historically and which are not. Furthermore, the users are prone to change their minds, leaving you with either a problem (type I where type II is required) or crappy queries (type II where type I is desired).

What we have started doing internally I think answers both needs and allows our users to mix and match type I and type II attributes with a minimal increase in query complexity and a little ETL - which fits my M/O, make the ETL do the hard stuff and rigorously enforce correctness so that users of your schemas have a relatively straightforward job. By the way, this is where ETL unit testing is invaluable. The solution I found to this problem uses what I call Durable Dimension Keys.

Slowly Changing Dimensions with Durable Keys

To more completely demonstrate the issues our users are facing, consider how a business user might utilize a fact table tied to the store dimension:

Our sales fact is keyed to a type II slowly changing dimension, which means that every store sk on the sales fact corresponds to the date the fact occurred. E.G., a fact which occurred on 01-01-2013 would include the surrogate key for the store record which was active on 01-01-2013 (indicated by the effective and end dates).

To illustrate the issues we have faced, let's define a business need to run sales by market (At Monkeys 'R Us that's pretty important). Initially it's easy enough:

Select

dim.Market ID,

SUM(fact.Sales)

From

Sales Fact fact

Join

Store Dim dim

fact.Store Sk = dim.Store Sk

Group By

dim.Market ID

Implicit in this design is that the report is providing a total of sales by historical market alignments - I.E., what market a store was considered in when the fact occurred. That's the beauty of type II dimensions.

Our business analysts however just came into our office complaining that the Data Warehouse is misrepresenting sales as compared to our ODS. After much consternation and head-scratching we uncover that the business recently realigned our stores and markets and neglected to inform anyone in IT (I know, I know, never happens . . .). We quickly scrap out an updated report to show year-over-year sales by the current market alignments:

Select
curr_dim.Market ID,
SUM(fact.Sales)
From
Sales Fact fact
Join
Store Dim hist_dim

fact.Store Sk = hist_dim.Store Sk

Join
Store Dim curr_dim

curr_dim.Store ID = hist_dim.Store ID

AND

curr_dim.end_date IS NULL

Group By

curr_dim.Market ID

Sidebar: our company standard is to use a null end date to indicate the current generation of an SCD-II dimension.

This is somewhat less than desirable, and going to make your queries slow. Furthermore, you have reduced your nice clean star schema into a snowflake with tables forming self joins. Adding another store reference makes this design cleaner but requires a little trickery:

Here I have added a durable store key, which is defined as a key which is unique to that dimension (E.G., a natural key, but always represented as an integer regardless of whether the natural key is a compound key or what the data types are) and can only be a positive integer on exactly one record - the current generation - every other generation of that object must have a negative value for that key. Note that this is in the dimension, not the fact - the fact only ever contains the valid durable key for the dimension and the surrogate key. In this usage the current store dimension is really a role-playing dimension since there are two keys in the fact pointing into the same dimension with different semantic meanings.

Every single fact is loaded the same way: Look up the store sk corresponding to when the event occurred, then find the durable store key from the current generation and record both.

The dimension is more complex. Let's consider when store number 1 is loaded for the first time. When the initial load happens, the record is created with an SK e.g. 1000, a store number of 1 and a durable store key of 1 (reusing the natural key in this case). When store number 1 changes, the first record is updated to set the end date, and also set the durable store key to -1. A new record is then inserted with store sk e.g. 1001, store number 1 and durable store key of 1. As the generations proceed there will be exactly one record with each sk, potentially many records with each natural key, and only one record per natural key with a positive durable key.

The purpose in all of this is to allow report consumers to decide which versions of each dimension attribute to use without (much) additional complexity - certainly in keeping with the goals of star schemas - and the fact record never has to be updated. The historical reporting requirement mentioned above in the first example remains unchanged, and the second example can now be expressed as:

Select
curr_dim.Market ID,
SUM(fact.Sales)
From
Sales Fact fact
Join

Store Dim curr_dim

curr_dim.Durable Store Key = fact.Durable Store Key

Group By

curr_dim.Market ID

And the users which want to push the envelope and mix and match current and historical attributes can continue in the same way:

Select
curr_dim.Market ID,
hist_dim.Open Date,
SUM(fact.Sales)
From
Sales Fact fact
Join
Store Dim hist_dim

fact.Store Sk = hist_dim.Store Sk

Join
Store Dim curr_dim

fact.Durable Store Key = curr_dim.Durable Store Key

Group By

curr_dim.Market ID

hist_dim.Open Date

This query looks very similar to the one I said was bad before, but now all joins are uni-directional - fact to dim - and there are no self-joins. Any reporting tool will support querying against that schema and produce simple queries.

Update: So, I just read the chapter "“Hybrid Slowly Changing Dimension Techniques” Excerpt From: Ralph Kimball & Margy Ross. “The Data Warehouse Toolkit.” iBooks. https://itun.es/us/-psMO.l", the Third edition (7/2013), and what I describe here is very similar to what Kimball calls a Type 7 dimension, except that I require the durable key to be populated on only the current generation, and he used views or some other mechanism to work it out as a logical role-playing dimension. I prefer my approach with respect to the handling of the durable key since it can easily be indexed and is straightforward to implement - and if referenced to the original dimension instead of the role-playing view it still works. I guess there truly is nothing new under the sun . . .

Update 2: I have recently changed my approach on the durable key value to use negative integers for historical records, and positive integers for the current generation. This has the benefit of allowing us to put a unique index and foreign key constraint on the key.

Friday, June 14, 2013

This is the first post on my sort-of technology blog. I am a Data Architect, with a background in object-oriented programming and database applications. I have also spent a lot of time writing networking applications in Java and C++ (the latter mostly in another life).

I currently own several open source projects on BitBucket - feel free to hop over there and use what interests you.

https://bitbucket.org/bradleysmithllc

My big project is etl-unit, a framework similar to junit but specifically targetted at testing ETL code in unfriendly (read closed and heavily proprietary) systems like Informatica, SSIS, Oracle PL/SQL, SQL Server, etc. These systems have interfaces available, but very little standardized with the exception of basic JDBC-type operations. Etl-unit seeks to make this work in a reliable, repeatable manner, and one that enables ETL teams to be effective.

I will probably post more about this and other things.