Late-arriving dimension members

In datawarehousing, a common scenario you may have encountered is the phenomenon of late-arriving dimension members. Imagine a situation where you get dimension members and fact data from two entirely different data sources, which may or may not provide data at the same time. Some fact data will probably be ready to load before its corresponding dimension member(s) arrive. How do you tag your fact data, and what consequence does that have for the integrity of your data warehouse?

Different approaches

One way to deal with late-arriving dimension members is by holding the fact data in a staging area until all the requisite dimension data is loaded. This is a relatively easy solution to build, although by design, it provides a distorted fact table that doesn’t reflect those fact rows that belong to late-arriving members. This can seriously skew the contents of the fact table. Worst case, the dimension members never arrive, and you’ll never see those facts loaded at all. So if you go this way, I would strongly advise you to investigate if your particular business requirements support this.

Another way is to create blank “interim” dimension members, basically a dimension member featuring nothing but the natural key (which you can get from the fact data). Once the actual dimension member arrives, you can update the dimension table with all those properties and details. In this respect, this is a SCD 1 dimension.

A third way to deal with late-arriving members could be to load all the fact transactions, but those rows that don’t match existing dimension members are marked with NULLs or “missing” members. Once the dimension members do arrive, you use some type of business logic to map out which fact rows should be tagged with which respective dimension members, and update those fact rows retroactively. This is a potentially very difficult way to solve this problem, primarily because fact data doesn’t always have a good primary key, and if it does, the volumes involved can be quite large, making this kind of updates rather tricky.

Example of an SCD 1 solution

Let’s take a look at how you could solve this problem using the SCD 1 approach outlined above. In my opinion, this is the least complex and least risky way to deal with late-arriving members.

To start off with, we’re going to need a dimension table, a fact table and some test data:

CREATE SCHEMA DW;
GO

--- This is the dimension table:
CREATE TABLE DW.dimension (
    [ID] int IDENTITY(1, 1) NOT NULL,
    naturalKey varchar(10) NOT NULL,
    [name] varchar(100) NOT NULL,
    isLateArriving bit NOT NULL,
    PRIMARY KEY CLUSTERED ([ID])
);

--- This is the fact table, complete with foreign key constraint:
CREATE TABLE DW.fact (
    [date] date NOT NULL,
    _dimension_ID int NOT NULL,
    amount numeric(10, 2) NOT NULL,
    PRIMARY KEY CLUSTERED ([date], _dimension_ID),
    FOREIGN KEY (_dimension_ID) REFERENCES DW.dimension ([ID])
);

--- Staging data for dimension members:
CREATE TABLE #sourceMembers (
    naturalKey varchar(10) NOT NULL,
    [name] varchar(100) NOT NULL,
    PRIMARY KEY CLUSTERED (naturalKey)
);

--- ... except, members 003 and 004 haven't arrived yet:
INSERT INTO #sourceMembers (naturalKey, [name])
VALUES ('001', 'First member'),
       ('002', 'Second member'),
/*     ('003', 'Third member'),
       ('004', 'Fourth member'), */
       ('005', 'Fifth member'),
       ('006', 'Sixth member');

--- Staging data for facts:
CREATE TABLE #sourceData (
    [date] date NOT NULL,
    naturalKey varchar(10) NOT NULL,
    amount numeric(10, 2) NOT NULL,
    PRIMARY KEY CLUSTERED ([date], naturalKey)
);

INSERT INTO #sourceData ([date], naturalKey, amount)
VALUES ({d '2013-12-31'}, '001', 1000.00),
       ({d '2014-01-31'}, '001', 1000.00),
       ({d '2014-02-28'}, '002', 1000.00),
       ({d '2014-03-31'}, '002', 1000.00),
       ({d '2014-04-30'}, '003', 1000.00),
       ({d '2014-05-31'}, '003', 1000.00),
       ({d '2014-06-30'}, '004', 1000.00),
       ({d '2014-07-31'}, '004', 1000.00),
       ({d '2014-08-31'}, '005', 1000.00);

--- This is how we load the dimension table:
MERGE INTO DW.dimension AS dim
USING #sourceMembers AS src ON
    dim.naturalKey=src.naturalKey

WHEN NOT MATCHED BY TARGET THEN
    INSERT (naturalKey, [name], isLateArriving)
    VALUES (src.naturalKey, src.[name], 0)

WHEN MATCHED THEN
    UPDATE SET dim.[name]=src.[name], dim.isLateArriving=0;

In the data above, the dimension members 003 and 004 don’t exist in the staging table for the dimension data, #sourceMembers, but they’re referenced in the staging table for the fact data, #sourceData. If we were to load this data directly into the dimension and fact tables, we would get one of two results. If we LEFT JOIN the dimension table, the INSERT operation into DW.fact will fail because some of the rows will yield a NULL value in _dimension_ID. If we INNER JOIN the dimension table, the fact data won’t include those rows.

The solution, then, is to create those two missing dimension members first, so we have a complete dimension table before we start loading fact data. Once we’ve done that, the fact load can be run. Here’s the entire batch, including a BEGIN/COMMIT TRANSACTION.

BEGIN TRANSACTION;

    --- Populating the dimension table with placeholders for
    --- late-arriving members
    INSERT INTO DW.dimension (naturalKey, [name], isLateArriving)
    SELECT DISTINCT fact.naturalKey, '(Late-arriving member)', 1
    FROM #sourceData AS fact
    WHERE fact.naturalKey NOT IN (
        SELECT naturalKey FROM DW.dimension);

    --- Populating fact table
    INSERT INTO DW.fact ([date], _dimension_ID, amount)
    SELECT fact.[date], dim.[ID] AS _dimension_ID, fact.amount
    FROM #sourceData AS fact
    INNER JOIN DW.dimension AS dim ON fact.naturalKey=dim.naturalKey;

COMMIT TRANSACTION;

The reason for the SQL transaction in this batch is that we need to protect ourselves from a scenario where data is added to #sourceData while we’re populating DW.dimension (or, technically, directly afterwards). This would result in fact rows with potential late-arriving members, where we wouldn’t have created those late-arriving members in the dimension table once we start loading facts. The SQL transaction locks the fact stage table when we create the late-arriving dimension members.

Further down the road, once the late members arrive, the standard dimension loading code will take care of updating the member rows in the dimension table:

--- This is how we load the dimension table:
MERGE INTO DW.dimension AS dim
USING #sourceMembers AS src ON
    dim.naturalKey=src.naturalKey

WHEN NOT MATCHED BY TARGET THEN
    INSERT (naturalKey, [name], isLateArriving)
    VALUES (src.naturalKey, src.[name], 0)

WHEN MATCHED THEN
    UPDATE SET dim.[name]=src.[name], dim.isLateArriving=0;

With this code, any changes in the staging table for the dimension data will be reflected in the dimension table, which by definition is SCD 1. If you only want the initial properties of the dimension members to be reflected, you can add another filter to the MERGE statement:

...
WHEN MATCHED AND dim.isLateArriving=1 THEN
    UPDATE SET dim.[name]=src.[name], dim.isLateArriving=0;

Let me hear your thoughts!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s