You will often encounter situations where you have to reconcile two sets of data – are they the same, and if not, what differences are there? This can be because you re-wrote a piece of code and you want to verify that it still does the same job, or it can be a straight-forward comparison of two data sources. In this article, we’ll look through some useful methods and functions to compare data in SQL Server.

CHECKSUM()

The CHECKSUM() function is a quick and easy way to get a first glimpse of your data. CHECKSUM() creates an integer checksum over a single row of data, over multiple columns.

CHECKSUM_AGG()

The CHECKSUM_AGG() function calculates checksums over a single column, over multiple rows. This makes it ideal to use together with the CHECKSUM() function to quickly create a checksum for an entire table.

SELECT CHECKSUM_AGG(CHECKSUM(*))
FROM dbo.myTable

Although you don’t have to use CHECKSUM(*) as your argument for CHECKSUM_AGG(), bear in mind that the CHECKSUM_AGG() function only accepts integer values as its input argument. As long as you stick to integer values, your safe.

Also, here’s the thing with checksums. They’re just checksums, and not unique representations of the underlying data; Two identical rows will always get the same checksum, but you will sometimes stumble over two different rows that get the same checksum. If that wasn’t the case, a checksum would be the perfect compression algorithm, right?

The FULL JOIN

My personal favourite when checking differences between two data sets is using a FULL JOIN to map out any differences. The basic setup of this is really simple. Here are two example tables, #before and #after, that we want to compare:

CREATE TABLE #before (
    key_a        int NOT NULL,
    key_b        int NOT NULL,
    value_c        numeric(12, 2) NULL,
    value_d        numeric(12, 2) NULL,
    PRIMARY KEY CLUSTERED (key_a, key_b)
)

INSERT INTO #before
SELECT 1, 2, 1234.56, NULL UNION ALL
SELECT 1, 3, 1235.56, NULL UNION ALL       --- Not the same in value_d
SELECT 1, 4, 1236.56, 1001.00 UNION ALL
SELECT 2, 3, 1237.56, 1002.00 UNION ALL    --- Not the same in value_c
SELECT 2, 4, 1238.56, 1003.00 UNION ALL    --- Not in #after
SELECT 2, 5, 1239.56, 1004.00

CREATE TABLE #after (
    key_a        int NOT NULL,
    key_b        int NOT NULL,
    value_c        numeric(12, 2) NULL,
    value_d        numeric(12, 2) NULL,
    PRIMARY KEY CLUSTERED (key_a, key_b)
)

INSERT INTO #after
SELECT 1, 1, 1234.56, NULL UNION ALL       --- Not in #before
SELECT 1, 2, 1234.56, NULL UNION ALL
SELECT 1, 3, 1235.56, NULL UNION ALL       --- Not the same in value_d
SELECT 1, 4, 1236.56, 1001.00 UNION ALL
SELECT 2, 3, 1240.00, 1002.00 UNION ALL    --- Not the same in value_c
SELECT 2, 5, 1239.56, 1004.00

The minimum requisite of any diff comparison between two tables like these is that they have the same primary key. In the example above, we’ve defined this primary key on the actual table, but you might as well use a GROUP BY clause in a subquery or CTE to make them unique. The reason for this, of course, is that we’re going to join them, and we don’t want any cartesian duplicates.

Here’s how the FULL JOIN comparison looks:

SELECT *
FROM #before AS x
FULL JOIN #after AS y ON
    x.key_a=y.key_a AND
    x.key_b=y.key_b

As you can see when you run the above query, you’ll see both sets side-by-side, but it may be hard for the untrained eye to catch the differences. So we’ll add columns to the query to help us get a better view of things:

SELECT ISNULL(x.key_a, y.key_a) AS key_a,
       ISNULL(x.key_b, y.key_b) AS key_b,
       x.value_c AS c_before, y.value_c AS c_after,
       x.value_d AS d_before, y.value_d AS d_after
FROM #before AS x
FULL JOIN #after AS y ON
    x.key_a=y.key_a AND
    x.key_b=y.key_b

Already clearer, now let’s add calculated difference columns:

SELECT ISNULL(x.key_a, y.key_a) AS key_a,
       ISNULL(x.key_b, y.key_b) AS key_b,
       x.value_c AS c_before, y.value_c AS c_after,
       ISNULL(x.value_c, 0.0)-ISNULL(y.value_c, 0.0) AS c_diff,
       x.value_d AS d_before, y.value_d AS d_after,
       ISNULL(x.value_d, 0.0)-ISNULL(y.value_d, 0.0) AS d_diff
FROM #before AS x
FULL JOIN #after AS y ON
    x.key_a=y.key_a AND
    x.key_b=y.key_b

The ISNULL() construct is used because we don’t know at design-time if one or both of the values is going to be NULL. You may recall that under normal ANSI flags, using an arithmetic operator with one or both values NULL will yield a NULL value, and that wouldn’t be any use to us here.

Finally, we can add a text field to show the user what’s going on.

SELECT ISNULL(x.key_a, y.key_a) AS key_a,
       ISNULL(x.key_b, y.key_b) AS key_b,
       x.value_c AS c_before, y.value_c AS c_after,
       ISNULL(x.value_c, 0.0)-ISNULL(y.value_c, 0.0) AS c_diff,
       x.value_d AS d_before, y.value_d AS d_after,
       ISNULL(x.value_d, 0.0)-ISNULL(y.value_d, 0.0) AS d_diff,
       (CASE WHEN x.key_a IS NULL THEN 'Added'
             WHEN y.key_a IS NULL THEN 'Deleted'
         WHEN ISNULL(x.value_c, 0.0)!=ISNULL(y.value_c, 0.0) THEN 'c'
         WHEN ISNULL(x.value_d, 0.0)!=ISNULL(y.value_d, 0.0) THEN 'd'
         END)
FROM #before AS x
FULL JOIN #after AS y ON
    x.key_a=y.key_a AND
    x.key_b=y.key_b

You will note in the CASE block that we test if a key column is NULL or not. If #before.key_a is NULL, that means that there is no corresponding record in #before to match the one in #after. This comparison has to be done on a non-nullable column, or it won’t return the correct results.

Finally, you can add the diff columns in the SQL above to the WHERE clause instead, that way, you’ll only see records where there is a difference.

Obviously, the exact syntax of your reconciliation query will depend on your specific requirements. Reconciliation jobs can be infinitely complex, so this is just a primer to get you started. Leave a comment if there’s anything you think I’ve left out.

Basic reconciliation skills

CHECKSUM()

CHECKSUM_AGG()

The FULL JOIN

1 comment

Leave a comment Cancel reply