In the third installment of the series on slowly changing dimensions, we’re going to tackle the question of how to manage accumulated fact aggregates in a solution that uses SCD 2 dimensions. While SCD 2 dimensions solve a lot of problems with slowly changing dimensions, accumulated values can still make a mess of the aggregate data.
The problem
When working with fact data that is accumulated over time, such as sales or accounting, along with SCD 2 dimensions, the results of the accumulation will vary if the aggregate is grouped by a property in the SCD 2 dimension that changes.
An accounting example
Here’s an example of a fact table and an SCD 2 dimension for a car manufacturer.
- There are three different car models, C, D and E, classified as small, medium and large respectively.
- As of 2013, a new model, B, is added.
- At the same time, the C model changes class from small to medium.
- In march 2013, the B model also changes size classification from small to medium-sized.
Here is the example setup:
--- SCD 2 dimension for models: CREATE TABLE SCD.model ( [ID] int NOT NULL, modelCode varchar(4) NOT NULL, sizeClass varchar(20) NOT NULL, fromDate date NOT NULL, toDate date NULL, CONSTRAINT PK_models PRIMARY KEY CLUSTERED ([ID]) ); INSERT INTO SCD.model ([ID], modelCode, sizeClass, fromDate, toDate) VALUES (101, 'C', 'Small car', {d '1990-01-01'}, {d '2012-12-31'}), (102, 'D', 'Medium-sized car', {d '1990-01-01'}, {d '2012-12-31'}), (103, 'E', 'Large car', {d '1990-01-01'}, {d '2012-12-31'}), (104, 'B', 'Small car', {d '2013-01-01'}, {d '2013-02-28'}), (105, 'C', 'Medium-sized car', {d '2013-01-01'}, {d '2013-02-28'}), (106, 'D', 'Medium-sized car', {d '2013-01-01'}, {d '2013-02-28'}), (107, 'E', 'Large car', {d '2013-01-01'}, {d '2013-02-28'}), (108, 'B', 'Medium-sized car', {d '2013-03-01'}, NULL), (109, 'C', 'Medium-sized car', {d '2013-03-01'}, NULL), (110, 'D', 'Medium-sized car', {d '2013-03-01'}, NULL), (111, 'E', 'Large car', {d '2013-03-01'}, NULL); --- Fact table for model sales: CREATE TABLE SCD.modelSales ( [ID] int IDENTITY(1, 1) NOT NULL, [date] date NOT NULL, model_ID int NOT NULL, number int NOT NULL, salesAmount numeric(10, 2) NOT NULL, isManual bit NOT NULL, CONSTRAINT PK_modelSales PRIMARY KEY CLUSTERED ([ID]), CONSTRAINT FK_modelSales_model FOREIGN KEY (model_ID) REFERENCES SCD.model ([ID]) ); INSERT INTO SCD.modelSales ([date], model_ID, number, salesAmount, isManual) VALUES ({d '2012-11-01'}, 101, 120, 1200000, 0), ({d '2012-11-01'}, 102, 112, 1120000, 0), ({d '2012-11-01'}, 103, 97, 970000, 0), ({d '2012-12-01'}, 101, 130, 1320000, 0), ({d '2012-12-01'}, 102, 125, 1280000, 0), ({d '2012-12-01'}, 103, 118, 1210000, 0), ({d '2012-01-01'}, 104, 115, 1280000, 0), ({d '2012-01-01'}, 105, 62, 790000, 0), ({d '2012-01-01'}, 106, 122, 1370000, 0), ({d '2012-01-01'}, 107, 127, 1390000, 0), ({d '2012-02-01'}, 104, 99, 1120000, 0), ({d '2012-02-01'}, 105, 87, 980000, 0), ({d '2012-02-01'}, 106, 140, 1940000, 0), ({d '2012-02-01'}, 107, 121, 1810000, 0), ({d '2013-03-01'}, 108, 91, 1090000, 0), ({d '2013-03-01'}, 109, 86, 970000, 0), ({d '2013-03-01'}, 110, 148, 1990000, 0), ({d '2013-03-01'}, 111, 118, 1710000, 0);
With this data, let’s take a look at how many small, medium-sized and large cars have been sold as of february 2013:
SELECT dim.sizeClass, SUM(fact.number) AS number, SUM(fact.salesAmount) AS salesAmount FROM SCD.modelSales AS fact INNER JOIN SCD.model AS dim ON dim.[ID]=fact.model_ID WHERE fact.[date]<={d '2013-02-01'} GROUP BY dim.sizeClass ORDER BY dim.sizeClass DESC;
By this way to calculate the totals, this aggregate will show the following sales:
- 464 small cars
- 648 medium-sized cars
- 463 large cars
But with the model classifications at the time (february 2013) of the different models, 250 of the 464 small cars are C models, which were classified as small cars at the time of sale, but are now considered medium-sized. So, by the reporting date, only 214 cars should be considered “small cars”.
SCD 1 vs SCD 2?
But if the classifications of the sold cars are determined by the most recent version, aren’t we just talking about a regular SCD 1 dimension? The answer is no, because there are also dimension versions after february 2013. What we’re looking for is how the dimension properties looked on february 2013, but we want to include all the sales up to that date, even if the classification may have been different at the time of the sale.
So, if we used SCD1, we wouldn’t know what the size categories were in february 2013, because it would have been overwritten by newer classifications.
A solution
To fix this problem, we’re can shift the sales balances and “rebrand” them to newer SCD 2 members (with the same natural key, but with a newer date). Here’s how it works:
By eliminating (subtracting) the 250 sold cars on january 1, then re-adding the same number of cars, we can shift the sales from dimension ID 101 to ID 105. This way, the accumulated sales for the C model are the same, but we’ve shifted the sales of model C cars from “small cars” to “mid-sized cars” on january 1.
This way, if you summarize sales as of december 31, you will see 250 small model C cars sold, but as of 2013, they are all medium-sized.
How to create the fact change records
To create these change records, we can use a query that performs the following steps:
- removes any previous change records,
- identify changes in the SCD 2 dimension,
- aggregate the accumulated amount for each dimension member up to the change date,
- create two new bookings on that date.
--- Delete existing change records DELETE FROM SCD.modelSales WHERE isManual=1; WITH versions ([ID], modelCode, sizeClass, ordinal, fromDate) AS ( --- Add version numbers to each version of the SCD 2 model dimension: SELECT [ID], modelCode, sizeClass, ROW_NUMBER() OVER ( PARTITION BY modelCode ORDER BY fromDate) AS ordinal, fromDate FROM SCD.model), versionChanges (fromID, toID, fromDate) AS ( --- Using the versioned view, isolate rows where dimension --- properties change from one version to the next: SELECT a.[ID] AS fromID, b.[ID] AS toID, b.fromDate FROM versions AS a INNER JOIN versions AS b ON a.modelCode=b.modelCode AND a.ordinal+1=b.ordinal WHERE a.sizeClass!=b.sizeClass), factRows ([date], fromID, toID, number, salesAmount) AS ( --- Calculate fact rows to be inserted: SELECT chg.fromDate AS [date], chg.fromID, chg.toID, SUM(fact.number) AS number, SUM(fact.salesAmount) AS salesAmount FROM versionChanges AS chg INNER JOIN SCD.modelSales AS fact ON chg.fromID=fact.model_ID AND chg.fromDate>=fact.[date] GROUP BY chg.fromDate, chg.fromID, chg.toID) --- Add two fact rows, one that eliminates the old balance... INSERT INTO SCD.modelSales ([date], model_ID, number, salesAmount, isManual) SELECT [date], fromID AS model_ID, 0-number, 0.0-salesAmount, 1 AS isManual FROM factRows UNION ALL --- ... and another one that adds the same balance to the new --- SCD 2 dimension key: SELECT [date], toID AS model_ID, number, salesAmount, 1 AS isManual FROM factRows;