Slowly changing dimensions (part 3)

In the third installment of the series on slowly changing dimensions, we’re going to tackle the question of how to manage accumulated fact aggregates in a solution that uses SCD 2 dimensions. While SCD 2 dimensions solve a lot of problems with slowly changing dimensions, accumulated values can still make a mess of the aggregate data.

The problem

When working with fact data that is accumulated over time, such as sales or accounting, along with SCD 2 dimensions, the results of the accumulation will vary if the aggregate is grouped by a property in the SCD 2 dimension that changes.

An accounting example

Here’s an example of a fact table and an SCD 2 dimension for a car manufacturer.

  • There are three different car models, C, D and E, classified as small, medium and large respectively.
  • As of 2013, a new model, B, is added.
  • At the same time, the C model changes class from small to medium.
  • In march 2013, the B model also changes size classification from small to medium-sized.

Here is the example setup:

--- SCD 2 dimension for models:
CREATE TABLE SCD.model (
    [ID]            int NOT NULL,
    modelCode       varchar(4) NOT NULL,
    sizeClass       varchar(20) NOT NULL,
    fromDate        date NOT NULL,
    toDate          date NULL,
    CONSTRAINT PK_models PRIMARY KEY CLUSTERED ([ID])
);

INSERT INTO SCD.model ([ID], modelCode, sizeClass, fromDate, toDate)
VALUES (101, 'C', 'Small car',        {d '1990-01-01'}, {d '2012-12-31'}),
       (102, 'D', 'Medium-sized car', {d '1990-01-01'}, {d '2012-12-31'}),
       (103, 'E', 'Large car',        {d '1990-01-01'}, {d '2012-12-31'}),
       (104, 'B', 'Small car',        {d '2013-01-01'}, {d '2013-02-28'}),
       (105, 'C', 'Medium-sized car', {d '2013-01-01'}, {d '2013-02-28'}),
       (106, 'D', 'Medium-sized car', {d '2013-01-01'}, {d '2013-02-28'}),
       (107, 'E', 'Large car',        {d '2013-01-01'}, {d '2013-02-28'}),
       (108, 'B', 'Medium-sized car', {d '2013-03-01'}, NULL),
       (109, 'C', 'Medium-sized car', {d '2013-03-01'}, NULL),
       (110, 'D', 'Medium-sized car', {d '2013-03-01'}, NULL),
       (111, 'E', 'Large car',        {d '2013-03-01'}, NULL);

--- Fact table for model sales:
CREATE TABLE SCD.modelSales (
    [ID]            int IDENTITY(1, 1) NOT NULL,
    [date]          date NOT NULL,
    model_ID        int NOT NULL,
    number          int NOT NULL,
    salesAmount     numeric(10, 2) NOT NULL,
    isManual        bit NOT NULL,
    CONSTRAINT PK_modelSales PRIMARY KEY CLUSTERED ([ID]),
    CONSTRAINT FK_modelSales_model
        FOREIGN KEY (model_ID) REFERENCES SCD.model ([ID])
);

INSERT INTO SCD.modelSales ([date], model_ID, number, salesAmount, isManual)
VALUES ({d '2012-11-01'}, 101,  120, 1200000, 0),
       ({d '2012-11-01'}, 102,  112, 1120000, 0),
       ({d '2012-11-01'}, 103,   97,  970000, 0),
       ({d '2012-12-01'}, 101,  130, 1320000, 0),
       ({d '2012-12-01'}, 102,  125, 1280000, 0),
       ({d '2012-12-01'}, 103,  118, 1210000, 0),
       ({d '2012-01-01'}, 104,  115, 1280000, 0),
       ({d '2012-01-01'}, 105,   62,  790000, 0),
       ({d '2012-01-01'}, 106,  122, 1370000, 0),
       ({d '2012-01-01'}, 107,  127, 1390000, 0),
       ({d '2012-02-01'}, 104,   99, 1120000, 0),
       ({d '2012-02-01'}, 105,   87,  980000, 0),
       ({d '2012-02-01'}, 106,  140, 1940000, 0),
       ({d '2012-02-01'}, 107,  121, 1810000, 0),
       ({d '2013-03-01'}, 108,   91, 1090000, 0),
       ({d '2013-03-01'}, 109,   86,  970000, 0),
       ({d '2013-03-01'}, 110,  148, 1990000, 0),
       ({d '2013-03-01'}, 111,  118, 1710000, 0);

With this data, let’s take a look at how many small, medium-sized and large cars have been sold as of february 2013:

SELECT dim.sizeClass, SUM(fact.number) AS number,
       SUM(fact.salesAmount) AS salesAmount
FROM SCD.modelSales AS fact
INNER JOIN SCD.model AS dim ON
    dim.[ID]=fact.model_ID
WHERE fact.[date]<={d '2013-02-01'}
GROUP BY dim.sizeClass
ORDER BY dim.sizeClass DESC;

By this way to calculate the totals, this aggregate will show the following sales:

  • 464 small cars
  • 648 medium-sized cars
  • 463 large cars

But with the model classifications at the time (february 2013) of the different models, 250 of the 464 small cars are C models, which were classified as small cars at the time of sale, but are now considered medium-sized. So, by the reporting date, only 214 cars should be considered “small cars”.

SCD 1 vs SCD 2?

But if the classifications of the sold cars are determined by the most recent version, aren’t we just talking about a regular SCD 1 dimension? The answer is no, because there are also dimension versions after february 2013. What we’re looking for is how the dimension properties looked on february 2013, but we want to include all the sales up to that date, even if the classification may have been different at the time of the sale.

So, if we used SCD1, we wouldn’t know what the size categories were in february 2013, because it would have been overwritten by newer classifications.

A solution

To fix this problem, we’re can shift the sales balances and “rebrand” them to newer SCD 2 members (with the same natural key, but with a newer date). Here’s how it works:

SCD 2 fact rebooking example

By eliminating (subtracting) the 250 sold cars on january 1, then re-adding the same number of cars, we can shift the sales from dimension ID 101 to ID 105. This way, the accumulated sales for the C model are the same, but we’ve shifted the sales of model C cars from “small cars” to “mid-sized cars” on january 1.

This way, if you summarize sales as of december 31, you will see 250 small model C cars sold, but as of 2013, they are all medium-sized.

How to create the fact change records

To create these change records, we can use a query that performs the following steps:

  • removes any previous change records,
  • identify changes in the SCD 2 dimension,
  • aggregate the accumulated amount for each dimension member up to the change date,
  • create two new bookings on that date.
--- Delete existing change records
DELETE FROM SCD.modelSales WHERE isManual=1;

WITH versions ([ID], modelCode, sizeClass, ordinal, fromDate)
AS (
    --- Add version numbers to each version of the SCD 2 model dimension:
    SELECT [ID], modelCode, sizeClass, ROW_NUMBER() OVER (
        PARTITION BY modelCode
        ORDER BY fromDate) AS ordinal, fromDate
    FROM SCD.model),

     versionChanges (fromID, toID, fromDate)
AS (
    --- Using the versioned view, isolate rows where dimension
    --- properties change from one version to the next:
    SELECT a.[ID] AS fromID, b.[ID] AS toID, b.fromDate
    FROM versions AS a
    INNER JOIN versions AS b ON
        a.modelCode=b.modelCode AND
        a.ordinal+1=b.ordinal
    WHERE a.sizeClass!=b.sizeClass),

     factRows ([date], fromID, toID, number, salesAmount)
AS (
    --- Calculate fact rows to be inserted:
    SELECT chg.fromDate AS [date], chg.fromID, chg.toID,
        SUM(fact.number) AS number, SUM(fact.salesAmount) AS salesAmount
    FROM versionChanges AS chg
    INNER JOIN SCD.modelSales AS fact ON
        chg.fromID=fact.model_ID AND
        chg.fromDate>=fact.[date]
    GROUP BY chg.fromDate, chg.fromID, chg.toID)

--- Add two fact rows, one that eliminates the old balance...
INSERT INTO SCD.modelSales ([date], model_ID, number, salesAmount, isManual)
SELECT [date], fromID AS model_ID, 0-number, 0.0-salesAmount, 1 AS isManual
FROM factRows
UNION ALL
--- ... and another one that adds the same balance to the new
--- SCD 2 dimension key:
SELECT [date], toID   AS model_ID,   number,     salesAmount, 1 AS isManual
FROM factRows;

Let me hear your thoughts!

This site uses Akismet to reduce spam. Learn how your comment data is processed.