This is the first article in a series that will describe what slowly changing dimensions (SCD for short) are, how they work, and why you might need to take them into account in your database or datawarehouse solution.
The one truth about data modelling
As with many other areas, data modelling for BI solutions has a few internationally renowned bright stars that have, at some point, pretty much defined the language and methodology of the business. In “slowly changing dimensions”, most terms and solutions originally stem from Ralph Kimball. The reason I’m using the same terminology is because this is what you will encounter on the net and when speaking to other database professionals. Also, some data models, such as the star schema, work a lot better with existing BI tools than others data models.
But, by all means, these are not absolute truths. I want to provide them to you as a starting point to create great solutions of your own.
Dimensions vs facts
In practically all models used in data warehousing and data mining, the data is divided in two entities; dimensions and facts. Dimensions contain the properties (name, etc) of objects, whereas facts are the numbers and amounts that are used in aggregates or to calculate KPIs.
So, for instance, if your client is an airline company, some commonly used dimensions could be Aircraft, Airport, Route, Booking class, Ticketing agent, Staff, etc. Typical facts, on the other hand, would be Miles, Number of passengers, Fuel consumption, Ticket revenue, Crew salary, etc.
To put it differently, dimensions are the stuff you put in a GROUP BY clause, while facts are numeric columns and expressions that are aggregated.
All this is typically implemented in a star schema layout in the database, where a fact table sits in the middle, flanked by any number of dimension tables:
That said, SCDs can be used and work just as well with “regular” snowflake OLTP databases as with star schema data warehouses.
What is a slowly changing dimension?
As you’ll notice, a lot of dimensions will change over time. Airports change their names, aircraft are upgraded, etc. This is what is called slowly changing dimensions. Say, for instance, you have an aircraft with a seating capacity of 299 passengers. If you fly 285 passengers on that plane, your “cabin factor”, i.e. the percentage of seats you sold, would be 95.3%. But if you upgrade the plane and fit more seats in it, it would actually look like your cabin factor before the upgrade was lower than it actually was, so when you calculate the cabin factor, you need to keep track of the seating capacity before and after the upgrade.
Another example: Suppose you sell cars; the model C car is a small car, D is a mid-sized and E is a large car. But over time, with the new model B coming out, management wants to refer to the B model as a small car, C and D are mid-sized, while E is still a large car. How do you measure how many small cars you’ve sold over the last few quarters?
This is why it’s important to handle changes in dimension properties, and there are a few different ways (SCD models), depending on your needs.
What isn’t a slowly changing dimension?
The “slow” part of SCD comes from the fact that if a given property would change frequently (fast), it should probably have its own dimension. So, if you’re using the exact same aircraft on every flight, the aircraft would be a property of a route/flight dimension. But if aircraft changes on routes are frequent, you are better off creating a specific dimension called “aircraft”.
So when modelling your data, make sure you understand which properties change slowly and which ones are better off tied in with the fact data.
Different ways to manage SCDs
There are a few different approaches to handling SCDs. I’ll list them briefly here, and return to them with T-SQL examples in the following posts.
SCD type 0: Static dimensions
I actually had to google this one, because I really hadn’t heard of it previously. But you could say it’s a dimension that doesn’t change, even if the underlying data does. Once you’ve written a record to the dimension table, it stays that way. In fact, you probably have droves of these in your datawarehouse already, with static/manual values that don’t change. Examples would include dimensions like “event”, “type” – or “date” for that matter.
SCD type 1: Static, lastest version
Dimensions of type 1 do not keep any history of past properties. But they are updated whenever the underlying data updates, so the dimension will always reflect the lastest version. If you haven’t specifically designed an SCD method for your datawarehouse, this is probably how it will look.
SCD type 2: Versioned
Type 2 dimensions create new records in the dimension table for each change in any property. This provides a complete history of all dimension versions. This is by far my favourite SCD model.
SCD type 3: Limited history
This model provides a very limited history and presents an increased complexity when writing queries for historic values, but it limits the size of the dimension table to one record per member.
In my personal opinion, SCD type 3 is not a good option if you need to keep track of changing dimensions, because very few changing dimensions only change once in the lifetime of a data warehouse. Also, the resulting table may become very wide with all the change columns introduced.
The best use of SCD type 3 is when you want to introduce a one-time change to an SCD type 0/1 dimension, knowing that only one or a few fields will change, for instance, with a change of ERP system, a corporate takeover of your organisation, or similar.
SCD type 4: History tables
SCD type 4 does not really provide a “true SCD” functionality in my view. It works more like an audit trail, where overwritten dimension records are moved to an audit/history table. This approach is more common in OLTP databases, mostly because you can target more specific tables (with fewer columns) for this kind of auditing.
The advantages of type 4 SCDs are that the dimension table still only holds one record per dimension member, regardless of how many versions exist, and you don’t have to resort to ugly extra history columns like SCD 3. The drawback is the obvious difficulty in modelling and writing queries on historic dimension properties.
There are a few hybrids based on the above types, but I won’t go into them, because chances are you a) won’t encounter them, and b) you should probably choose a better model.
More on slowly changing dimensions
Stay tuned for the next part of this post, where we’ll look at practical T-SQL implementations of SCD.