This installment in the series on efficient data is on versioning changes in a table. The article is a re-post of a post I wrote in september on compressing slowly changing dimensions, although the concept does not only apply to dimensions – it can be used pretty much on any data that changes over time.
The idea is to “compress” a versioned table, so instead of just adding a date column for each version, you can compress multiple, sequential versions into a single row with a “from” date and a “to” date. This can significantly compress the size of the table.
The setup
Here’s the sample data we’re going to be working with. It’s an imaginary airline company where all the data is delivered in a raw text file that is dumped into a staging table.
CREATE SCHEMA SCD;
GO
CREATE TABLE SCD.rawData (
[date] date NOT NULL,
flightNo smallint NOT NULL,
registration varchar(10) NOT NULL,
manufacturer varchar(20) NOT NULL,
model varchar(20) NOT NULL,
seatsInBusiness smallint NOT NULL,
seatsInEconomy smallint NOT NULL,
fromAirport char(3) NOT NULL,
toAirport char(3) NOT NULL,
paxInBusiness smallint NOT NULL,
paxInEconomy smallint NOT NULL,
miles smallint NOT NULL,
CONSTRAINT PK_rawData
PRIMARY KEY CLUSTERED ([date], flightNo)
);
--- .. and some sample data:
INSERT INTO SCD.rawData ([date], flightNo, registration, manufacturer, model,
seatsInBusiness, seatsInEconomy, fromAirport, toAirport,
paxInBusiness, paxInEconomy, miles)
VALUES ({d '2013-09-03'}, 201, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'ARN', 'JFK', 14, 250, 3910),
({d '2013-09-03'}, 202, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'JFK', 'ARN', 12, 249, 3910),
({d '2013-09-04'}, 201, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'ARN', 'JFK', 22, 261, 3910),
({d '2013-09-04'}, 202, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'JFK', 'ARN', 19, 252, 3910),
({d '2013-09-05'}, 201, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'ARN', 'JFK', 18, 251, 3910),
({d '2013-09-05'}, 202, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'JFK', 'ARN', 24, 263, 3910),
({d '2013-09-06'}, 201, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'ARN', 'JFK', 31, 261, 3910),
({d '2013-09-06'}, 202, 'XY1234', 'Airbus', 'A330-300 (v1)', 36, 263, 'JFK', 'ARN', 29, 221, 3910),
({d '2013-09-09'}, 201, 'XY1234', 'Airbus', 'A330-300 (v2)', 42, 263, 'ARN', 'JFK', 34, 263, 3910),
({d '2013-09-09'}, 202, 'XY1234', 'Airbus', 'A330-300 (v2)', 42, 263, 'JFK', 'ARN', 38, 258, 3910),
({d '2013-09-10'}, 201, 'XY1234', 'Airbus', 'A330-300 (v2)', 42, 263, 'ARN', 'JFK', 35, 260, 3910),
({d '2013-09-10'}, 202, 'XY1234', 'Airbus', 'A330-300 (v2)', 42, 263, 'JFK', 'ARN', 34, 261, 3910);
In addition to this, we’re going to set up a fact table and two dimensions:
--- Static (SCD 0) table with airports: CREATE TABLE SCD.airport ( [ID] int IDENTITY(101, 1) NOT NULL, airportCode char(3) NOT NULL, displayName varchar(100) NOT NULL, CONSTRAINT PK_airport PRIMARY KEY CLUSTERED ([ID]) ); INSERT INTO SCD.airport (airportCode, displayName) VALUES ('ARN', 'Stockholm Arlanda'), ('JFK', 'New York City JFK'); --- SCD 2 table with aircraft dimension: CREATE TABLE SCD.aircraft ( [ID] int IDENTITY(101, 1) NOT NULL, registration varchar(10) NOT NULL, fromDate date NOT NULL, toDate date NULL, manufacturer varchar(20) NOT NULL, model varchar(20) NOT NULL, seatsInBusiness smallint NOT NULL, seatsInEconomy smallint NOT NULL, CONSTRAINT PK_aircraft PRIMARY KEY CLUSTERED ([ID]) ); --- Fact table with all flights: CREATE TABLE SCD.flight ( aircraft_ID int NOT NULL, [date] date NOT NULL, flightNo smallint NOT NULL, fromAirport_ID int NOT NULL, toAirport_ID int NOT NULL, paxInBusiness smallint NOT NULL, paxInEconomy smallint NOT NULL, miles smallint NOT NULL, CONSTRAINT PK_flight PRIMARY KEY CLUSTERED ([date], flightNo), CONSTRAINT FK_flight_aircraft FOREIGN KEY (aircraft_ID) REFERENCES SCD.aircraft ([ID]), CONSTRAINT FK_flight_fromAirport FOREIGN KEY (fromAirport_ID) REFERENCES SCD.airport ([ID]), CONSTRAINT FK_flight_toAirport FOREIGN KEY (toAirport_ID) REFERENCES SCD.airport ([ID]) );
Ordering the dimension data
The first thing we’re going to do is to order the dimension data by date, using the ROW_NUMBER() windowed function. The dimension’s natural key is the aircraft’s registration number, so this is what we’re going to put in PARTITION BY. The ordering column is the date, so that goes into ORDER BY.
Here’s how it will look:
WITH versions AS ( SELECT registration, [date], manufacturer, model, seatsInBusiness, seatsInEconomy, ROW_NUMBER() OVER ( PARTITION BY registration ORDER BY [date], flightNo) AS ordinal FROM SCD.rawData) SELECT * FROM versions AS a ORDER BY a.registration, a.ordinal;
Note how we get one record per plane and date.
Identifying change records
In order to keep the dimension to a reasonable size, we really want to identify just those records that contain changes. In order to do that, we can use the common table expression above to join itself like this:
FROM versions AS a RIGHT JOIN versions AS b ON a.registration=b.registration AND a.ordinal+1=b.ordinal
… so the “a” table contains the dimension properties as they were before the change, and “b” as they were after, respectively. This is accomplished by joining on “a.ordinal+1=b.ordinal”.
The reson for the RIGHT JOIN is that we don’t just want the changes from one record to the next, but we also want the first version of each member, i.e. where “a” (the previous version) does not exist and b.ordinal=1.
To do this, we’ll apply a WHERE clause that isolates records where b.ordinal=1 or where there’s a change in any of the property columns. Finally, because we’ve filtered out a lot of rows, we’ll need to recalculate the ordinal number, so it’s “dense” once again (see column “denseOrdinal”).
WITH versions AS ( SELECT registration, [date], manufacturer, model, seatsInBusiness, seatsInEconomy, ROW_NUMBER() OVER ( PARTITION BY registration ORDER BY [date], flightNo) AS ordinal FROM SCD.rawData) SELECT b.registration, b.ordinal, b.[date], --- Calculate a new, dense, ordinal number: ROW_NUMBER() OVER ( PARTITION BY b.registration ORDER BY b.ordinal) AS denseOrdinal FROM versions AS a RIGHT JOIN versions AS b ON a.registration=b.registration AND a.ordinal+1=b.ordinal WHERE --- The first row: b.ordinal=1 OR --- ... or rows where something changes --- from one date to the next: a.manufacturer!=b.manufacturer OR a.model!=b.model OR a.seatsInBusiness!=b.seatsInBusiness OR a.seatsInEconomy!=b.seatsInEconomy;
MERGE’ing into a dimension table
All that remains is to merge this data into the dimension table. Here is the complete query:
WITH -- Step 1: Assigning ordinal numbers to the different --- versions of the dimension members versions AS ( SELECT registration, [date], manufacturer, model, seatsInBusiness, seatsInEconomy, ROW_NUMBER() OVER ( PARTITION BY registration ORDER BY [date], flightNo) AS ordinal FROM SCD.rawData), --- Step 2: Filtering out just the change records for --- the dimensions: distinctVersions AS ( SELECT b.registration, b.ordinal, b.[date], --- Calculate a new, dense, ordinal number: ROW_NUMBER() OVER ( PARTITION BY b.registration ORDER BY b.ordinal) AS denseOrdinal FROM versions AS a RIGHT JOIN versions AS b ON a.registration=b.registration AND a.ordinal+1=b.ordinal WHERE --- The first row: b.ordinal=1 OR --- ... or rows where something changes --- from one date to the next: a.manufacturer!=b.manufacturer OR a.model!=b.model OR a.seatsInBusiness!=b.seatsInBusiness OR a.seatsInEconomy!=b.seatsInEconomy), --- Step 3: Join the recordset in "distinctVersions" to "versions" to --- obtain all the properties for each version. In order to --- calculate the "toDate" column, "d2" is also joined. final AS ( SELECT v1.registration, d1.[date] AS fromDate, d2.[date] AS toDate, v1.manufacturer, v1.model, v1.seatsInBusiness, v1.seatsInEconomy FROM distinctVersions AS d1 LEFT JOIN distinctVersions AS d2 ON d1.registration=d2.registration AND d1.denseOrdinal+1=d2.denseOrdinal INNER JOIN versions AS v1 ON d1.registration=v1.registration AND d1.ordinal=v1.ordinal) --- MERGE the resulting data into the dimension table: MERGE INTO SCD.aircraft USING final ON aircraft.registration=final.registration AND aircraft.fromDate=final.fromDate --- Records that don't exist any more (or where fromDate has --- changed, for that matter) WHEN NOT MATCHED BY SOURCE THEN DELETE --- Existing records, to be updated: WHEN MATCHED THEN UPDATE SET aircraft.toDate=final.toDate, aircraft.manufacturer=final.manufacturer, aircraft.model=final.model, aircraft.seatsInBusiness=final.seatsInBusiness, aircraft.seatsInEconomy=final.seatsInEconomy --- New records, to be inserted into the dimension table: WHEN NOT MATCHED BY TARGET THEN INSERT (registration, fromDate, toDate, manufacturer, model, seatsInBusiness, seatsInEconomy) VALUES (registration, fromDate, toDate, manufacturer, model, seatsInBusiness, seatsInEconomy);
The aircraft dimension table now contains two records; in this example two versions of the same dimension member.
How to join the fact data
When populating the fact data, you’ll have to keep in mind that SCD 2 dimensions can potentially contain more than one row per natural key (registration number), because there are multiple versions.
So when joining the “aircraft” dimension table to the “rawData” table, the join key will not only include both the aircraft registration number and the date.
INSERT INTO SCD.flight (aircraft_ID, [date], flightNo, fromAirport_ID, toAirport_ID, paxInBusiness, paxInEconomy, miles) SELECT ac.[ID] AS aircraft_ID, r.[date], r.flightNo, fa.[ID] AS fromAirport, ta.[ID] AS toAirport_ID, r.paxInBusiness, r.paxInEconomy, r.miles FROM SCD.aircraft AS ac INNER JOIN SCD.rawData AS r ON r.registration=ac.registration AND ac.fromDate<=r.[date] AND (ac.toDate>r.[date] OR ac.toDate IS NULL) INNER JOIN SCD.airport AS fa ON fa.airportCode=r.fromAirport INNER JOIN SCD.airport AS ta ON ta.airportCode=r.toAirport;
Note how the last version of a dimension member has a NULL value in the “toDate” column, which means we have to build that into the join to the “aircraft” dimension table.