Efficient data, part 6: Versioning changes

This installment in the series on efficient data is on versioning changes in a table. The article is a re-post of a post I wrote in september on compressing slowly changing dimensions, although the concept does not only apply to dimensions – it can be used pretty much on any data that changes over time.

The idea is to “compress” a versioned table, so instead of just adding a date column for each version, you can compress multiple, sequential versions into a single row with a “from” date and a “to” date. This can significantly compress the size of the table.

The setup

Here’s the sample data we’re going to be working with. It’s an imaginary airline company where all the data is delivered in a raw text file that is dumped into a staging table.

CREATE SCHEMA SCD;
GO
CREATE TABLE SCD.rawData (
    [date]           date NOT NULL,
    flightNo         smallint NOT NULL,
    registration     varchar(10) NOT NULL,
    manufacturer     varchar(20) NOT NULL,
    model            varchar(20) NOT NULL,
    seatsInBusiness  smallint NOT NULL,
    seatsInEconomy   smallint NOT NULL,
    fromAirport      char(3) NOT NULL,
    toAirport        char(3) NOT NULL,
    paxInBusiness    smallint NOT NULL,
    paxInEconomy     smallint NOT NULL,
    miles            smallint NOT NULL,
    CONSTRAINT PK_rawData
        PRIMARY KEY CLUSTERED ([date], flightNo)
);
--- .. and some sample data:
INSERT INTO SCD.rawData ([date], flightNo, registration, manufacturer, model,
    seatsInBusiness, seatsInEconomy, fromAirport, toAirport,
    paxInBusiness, paxInEconomy, miles)
VALUES ({d '2013-09-03'}, 201, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'ARN', 'JFK', 14, 250, 3910),
       ({d '2013-09-03'}, 202, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'JFK', 'ARN', 12, 249, 3910),
       ({d '2013-09-04'}, 201, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'ARN', 'JFK', 22, 261, 3910),
       ({d '2013-09-04'}, 202, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'JFK', 'ARN', 19, 252, 3910),
       ({d '2013-09-05'}, 201, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'ARN', 'JFK', 18, 251, 3910),
       ({d '2013-09-05'}, 202, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'JFK', 'ARN', 24, 263, 3910),
       ({d '2013-09-06'}, 201, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'ARN', 'JFK', 31, 261, 3910),
       ({d '2013-09-06'}, 202, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'JFK', 'ARN', 29, 221, 3910),
       ({d '2013-09-09'}, 201, 'XY1234', 'Airbus', 'A330-300 (v2)', 42, 263, 'ARN', 'JFK', 34, 263, 3910),
       ({d '2013-09-09'}, 202, 'XY1234', 'Airbus', 'A330-300 (v2)', 42, 263, 'JFK', 'ARN', 38, 258, 3910),
       ({d '2013-09-10'}, 201, 'XY1234', 'Airbus', 'A330-300 (v2)', 42, 263, 'ARN', 'JFK', 35, 260, 3910),
       ({d '2013-09-10'}, 202, 'XY1234', 'Airbus', 'A330-300 (v2)', 42, 263, 'JFK', 'ARN', 34, 261, 3910);

In addition to this, we’re going to set up a fact table and two dimensions:

--- Static (SCD 0) table with airports:
CREATE TABLE SCD.airport (
    [ID]               int IDENTITY(101, 1) NOT NULL,
    airportCode        char(3) NOT NULL,
    displayName        varchar(100) NOT NULL,
    CONSTRAINT PK_airport
        PRIMARY KEY CLUSTERED ([ID])
);

INSERT INTO SCD.airport (airportCode, displayName)
VALUES ('ARN', 'Stockholm Arlanda'),
       ('JFK', 'New York City JFK');

--- SCD 2 table with aircraft dimension:
CREATE TABLE SCD.aircraft (
    [ID]               int IDENTITY(101, 1) NOT NULL,
    registration       varchar(10) NOT NULL,
    fromDate           date NOT NULL,
    toDate             date NULL,
    manufacturer       varchar(20) NOT NULL,
    model              varchar(20) NOT NULL,
    seatsInBusiness    smallint NOT NULL,
    seatsInEconomy     smallint NOT NULL,
    CONSTRAINT PK_aircraft
        PRIMARY KEY CLUSTERED ([ID])
);

--- Fact table with all flights:
CREATE TABLE SCD.flight (
    aircraft_ID        int NOT NULL,
    [date]             date NOT NULL,
    flightNo           smallint NOT NULL,
    fromAirport_ID     int NOT NULL,
    toAirport_ID       int NOT NULL,
    paxInBusiness      smallint NOT NULL,
    paxInEconomy       smallint NOT NULL,
    miles              smallint NOT NULL,
    CONSTRAINT PK_flight
        PRIMARY KEY CLUSTERED ([date], flightNo),
    CONSTRAINT FK_flight_aircraft
        FOREIGN KEY (aircraft_ID) REFERENCES SCD.aircraft ([ID]),
    CONSTRAINT FK_flight_fromAirport
        FOREIGN KEY (fromAirport_ID) REFERENCES SCD.airport ([ID]),
    CONSTRAINT FK_flight_toAirport
        FOREIGN KEY (toAirport_ID) REFERENCES SCD.airport ([ID])
);

Ordering the dimension data

The first thing we’re going to do is to order the dimension data by date, using the ROW_NUMBER() windowed function. The dimension’s natural key is the aircraft’s registration number, so this is what we’re going to put in PARTITION BY. The ordering column is the date, so that goes into ORDER BY.

Here’s how it will look:

WITH versions AS (
    SELECT registration, [date], manufacturer, model, seatsInBusiness, seatsInEconomy,
        ROW_NUMBER() OVER (
            PARTITION BY registration
            ORDER BY [date], flightNo) AS ordinal
    FROM SCD.rawData)

SELECT *
FROM versions AS a
ORDER BY a.registration, a.ordinal;

Note how we get one record per plane and date.

Identifying change records

In order to keep the dimension to a reasonable size, we really want to identify just those records that contain changes. In order to do that, we can use the common table expression above to join itself like this:

FROM versions AS a
RIGHT JOIN versions AS b ON
    a.registration=b.registration AND
    a.ordinal+1=b.ordinal

… so the “a” table contains the dimension properties as they were before the change, and “b” as they were after, respectively. This is accomplished by joining on “a.ordinal+1=b.ordinal”.

The reson for the RIGHT JOIN is that we don’t just want the changes from one record to the next, but we also want the first version of each member, i.e. where “a” (the previous version) does not exist and b.ordinal=1.

To do this, we’ll apply a WHERE clause that isolates records where b.ordinal=1 or where there’s a change in any of the property columns. Finally, because we’ve filtered out a lot of rows, we’ll need to recalculate the ordinal number, so it’s “dense” once again (see column “denseOrdinal”).

WITH versions AS (
    SELECT registration, [date], manufacturer, model, seatsInBusiness,
        seatsInEconomy, ROW_NUMBER() OVER (
            PARTITION BY registration
            ORDER BY [date], flightNo) AS ordinal
    FROM SCD.rawData)

SELECT b.registration, b.ordinal, b.[date],
    --- Calculate a new, dense, ordinal number:
    ROW_NUMBER() OVER (
        PARTITION BY b.registration
        ORDER BY b.ordinal) AS denseOrdinal
FROM versions AS a
RIGHT JOIN versions AS b ON
    a.registration=b.registration AND
    a.ordinal+1=b.ordinal
WHERE --- The first row:
      b.ordinal=1 OR
      --- ... or rows where something changes
      --- from one date to the next:
      a.manufacturer!=b.manufacturer OR
      a.model!=b.model OR
      a.seatsInBusiness!=b.seatsInBusiness OR
      a.seatsInEconomy!=b.seatsInEconomy;

MERGE’ing into a dimension table

All that remains is to merge this data into the dimension table. Here is the complete query:

WITH -- Step 1: Assigning ordinal numbers to the different
    ---         versions of the dimension members
     versions AS (
    SELECT registration, [date], manufacturer, model, seatsInBusiness,
        seatsInEconomy, ROW_NUMBER() OVER (
            PARTITION BY registration
            ORDER BY [date], flightNo) AS ordinal
    FROM SCD.rawData),

    --- Step 2: Filtering out just the change records for
    ---         the dimensions:
     distinctVersions AS (
        SELECT b.registration, b.ordinal, b.[date],
            --- Calculate a new, dense, ordinal number:
            ROW_NUMBER() OVER (
                PARTITION BY b.registration
                ORDER BY b.ordinal) AS denseOrdinal
        FROM versions AS a
        RIGHT JOIN versions AS b ON
            a.registration=b.registration AND
            a.ordinal+1=b.ordinal
        WHERE --- The first row:
              b.ordinal=1 OR
              --- ... or rows where something changes
              --- from one date to the next:
              a.manufacturer!=b.manufacturer OR
              a.model!=b.model OR
              a.seatsInBusiness!=b.seatsInBusiness OR
              a.seatsInEconomy!=b.seatsInEconomy),

    --- Step 3: Join the recordset in "distinctVersions" to "versions" to
    ---         obtain all the properties for each version. In order to
    ---         calculate the "toDate" column, "d2" is also joined.
     final AS (
        SELECT v1.registration, d1.[date] AS fromDate,
               d2.[date] AS toDate, v1.manufacturer,
               v1.model, v1.seatsInBusiness, v1.seatsInEconomy
        FROM distinctVersions AS d1
        LEFT JOIN distinctVersions AS d2 ON
            d1.registration=d2.registration AND
            d1.denseOrdinal+1=d2.denseOrdinal
        INNER JOIN versions AS v1 ON
            d1.registration=v1.registration AND
            d1.ordinal=v1.ordinal)

--- MERGE the resulting data into the dimension table:
MERGE INTO SCD.aircraft
USING final ON aircraft.registration=final.registration AND
               aircraft.fromDate=final.fromDate

--- Records that don't exist any more (or where fromDate has
--- changed, for that matter)
WHEN NOT MATCHED BY SOURCE THEN
    DELETE

--- Existing records, to be updated:
WHEN MATCHED THEN
    UPDATE SET aircraft.toDate=final.toDate,
               aircraft.manufacturer=final.manufacturer,
               aircraft.model=final.model,
               aircraft.seatsInBusiness=final.seatsInBusiness,
               aircraft.seatsInEconomy=final.seatsInEconomy

--- New records, to be inserted into the dimension table:
WHEN NOT MATCHED BY TARGET THEN
    INSERT (registration, fromDate, toDate, manufacturer,
            model, seatsInBusiness, seatsInEconomy)
    VALUES (registration, fromDate, toDate, manufacturer,
            model, seatsInBusiness, seatsInEconomy);

The aircraft dimension table now contains two records; in this example two versions of the same dimension member.

How to join the fact data

When populating the fact data, you’ll have to keep in mind that SCD 2 dimensions can potentially contain more than one row per natural key (registration number), because there are multiple versions.

So when joining the “aircraft” dimension table to the “rawData” table, the join key will not only include both the aircraft registration number and the date.

INSERT INTO SCD.flight (aircraft_ID, [date],
    flightNo, fromAirport_ID, toAirport_ID,
    paxInBusiness, paxInEconomy, miles)
SELECT ac.[ID] AS aircraft_ID, r.[date], r.flightNo,
    fa.[ID] AS fromAirport, ta.[ID] AS toAirport_ID,
    r.paxInBusiness, r.paxInEconomy, r.miles
FROM SCD.aircraft AS ac
INNER JOIN SCD.rawData AS r ON
    r.registration=ac.registration AND
    ac.fromDate<=r.[date] AND
    (ac.toDate>r.[date] OR
     ac.toDate IS NULL)
INNER JOIN SCD.airport AS fa ON
    fa.airportCode=r.fromAirport
INNER JOIN SCD.airport AS ta ON
    ta.airportCode=r.toAirport;

Note how the last version of a dimension member has a NULL value in the “toDate” column, which means we have to build that into the join to the “aircraft” dimension table.

 

Let me hear your thoughts!

This site uses Akismet to reduce spam. Learn how your comment data is processed.