Cool MERGE features you may not know about

The MERGE statement is a really powerful way to create what’s called “upserts”. In this article, I’ll take a more detailed look at how you can make the best use of MERGE and I’ll also show you some cool tricks I picked up along the way.

The terminology

For the purpose of this article, I’m making a difference between join conditions and search conditions. The join condition specifies how you join the source and target data together, whereas the search condition is one of the following three, used in the MERGE syntax:

  • WHEN NOT MATCHED BY SOURCE – a row exists in the target table, but not in the source
  • WHEN MATCHED – a matching row exists in both data sets
  • WHEN NOT MATCHED BY TARGET – a row exists in the source, but not in the target table

Join conditions are important!

When you’re performing a MERGE, which set of search conditions you use determines how the two data sets are joined, which in turn has performance implications. If you use only “when matched” conditions, you’re actually performing an INNER JOIN. If you use one of the two “not matched by” conditions, you’re performing a LEFT JOIN of sorts (actually, an anti-semi join), and finally, if you use both “not matched by” conditions, the resulting operating is equivalent to a FULL JOIN.

The reason I’m bringing this up is because in an INNER JOIN, it doesn’t really matter if you mix all your JOIN and WHERE arguments in the JOIN operator, but in outer joins, you’ll have to place the WHERE arguments in a common table expression or in the search condition to make the query work.

Adding extra conditions for performance

When working with fairly large result sets, particularly with incremental updates, most of the matched rows will probably already be in sync. You can actually help SQL Server skip these rows by adding an extra condition to the “when matched” search condition (using AND). I’ve found that this can provide a huge performance boost, primarily because you vastly limit the amount of rows that are updated (which saves on CPU and I/O load), but possibly also because a smaller dataset can result in a more optimal execution plan.

In this example, the matched rows (2 and 3) are already identical, so we really don’t need to update them.

--- Example source table:
DECLARE @source TABLE (
    pk       int NOT NULL,
    x        int NULL,
    y        int NULL,
    PRIMARY KEY CLUSTERED (pk)
);

INSERT INTO @source (pk, x, y)
VALUES  (1, 100, 101),
        (2, 200, 201),
        (3, 300, 301);

--- Example target table:
DECLARE @target TABLE (
    pk        int NOT NULL,
    x        int NULL,
    y        int NULL,
    PRIMARY KEY CLUSTERED (pk)
);

INSERT INTO @target (pk, x, y)
VALUES (2, 200, 201),
       (3, 300, 301),
       (4, 400, 401);

SELECT * FROM @source;

--- The MERGE statement:
MERGE INTO @target AS tgt
USING @source AS src
    --- ... the JOIN condition:
    ON tgt.pk=src.pk

WHEN NOT MATCHED BY SOURCE THEN
    DELETE

--- When matched,
--- but only if something needs to be updated!
WHEN MATCHED AND (tgt.x!=src.x OR tgt.y!=src.y) THEN
    UPDATE
    SET tgt.x=src.x,
        tgt.y=src.y

WHEN NOT MATCHED BY TARGET THEN
    INSERT (pk, x, y)
    VALUES (pk, x, y);

--- The resulting target table:
SELECT * FROM @target;

Limiting the scope of MERGE operations

If your source dataset only contains a subset (like a partition or batch) of the target data, a regular MERGE operation won’t work for you. Take a look at this example, where we load data in “batches”. The target table already contains data from batches 1, 2 and 3 (from previous loads), but the source table contains only data for batch 3.

--- Example source table:
DECLARE @source TABLE (
    pk       int NOT NULL,
    x        int NULL,
    y        int NULL,
    PRIMARY KEY CLUSTERED (pk)
);

INSERT INTO @source (pk, x, y)
VALUES (9, 300, 301),
      (10, 400, 401),
      (11, 500, 501);

--- Example target table:
DECLARE @target TABLE (
    pk        int NOT NULL,
    batch     int NOT NULL,
    x         int NULL,
    y         int NULL,
    PRIMARY KEY CLUSTERED (pk)
);

INSERT INTO @target (pk, batch, x, y)
VALUES (2, 1, 200, 201),
       (3, 1, 300, 301),
       (4, 1, 400, 401),
       (5, 2, 200, 201),
       (6, 2, 300, 301),
       (7, 2, 400, 401),
       (8, 3, 200, 201),
       (9, 3, 300, 301);

SELECT * FROM @source;

DECLARE @batch int=3;

--- The MERGE statement:
MERGE INTO @target AS tgt
USING @source AS src
    --- ... the JOIN condition:
    ON tgt.pk=src.pk

WHEN NOT MATCHED BY SOURCE THEN
    DELETE

--- When matched:
WHEN MATCHED AND (tgt.x!=src.x OR tgt.y!=src.y) THEN
    UPDATE
    SET tgt.x=src.x,
        tgt.y=src.y

WHEN NOT MATCHED BY TARGET THEN
    INSERT (pk, batch, x, y)
    VALUES (pk, @batch, x, y);

--- The resulting target table:
SELECT * FROM @target;

You’ll notice that this MERGE statement will delete everything from batches 1 and 2 in the target table, because they aren’t matched by the source table (containing only batch 3).

The solution is to modify the “when not matched by source” search condition, so we also check for the correct batch number:

--- When not matched by source
--- ... and from the same batch!
WHEN NOT MATCHED BY SOURCE AND tgt.batch=@batch THEN
    DELETE

Like the previous paragraph, you can make good use of this technique to optimize performance, even if both the source and target tables have all the data. As long as you know that only a single batch/partition of your data could have been modified, you only need to match and check that specific batch, and not the entire table.

Bonus: More than one WHEN MATCHED condition

The “when matched” search condition can appear twice! If this is the case, the first of the two search conditions has to use an “AND” construct (as discussed in the previous paragraph). Also, one of the two conditions must be an UPDATE and the other one a DELETE. These rules may seem a bit arbitrary, but they make sense when you think about them.

Just like with CASE, the two conditions don’t neccessarily need to be mutually exclusive, because they are evaluated in the order they are stated in the code.

Here’s an example, where we have a “deleteMe” column in the source dataset. If this column is 1, the matching target row is to be deleted, otherwise we want to update it.

--- Example source table, with the deleteMe column:
DECLARE @source TABLE (
    pk       int NOT NULL,
    x        int NULL,
    y        int NULL,
    deleteMe bit NOT NULL,
    PRIMARY KEY CLUSTERED (pk)
);

INSERT INTO @source (pk, x, y, deleteMe)
VALUES  (1, 100, 101, 0),
        (2, 200, 201, 1),
        (3, 300, 301, 0);

--- Example target table:
DECLARE @target TABLE (
    pk       int NOT NULL,
    x        int NULL,
    y        int NULL,
    PRIMARY KEY CLUSTERED (pk)
);

INSERT INTO @target (pk, x, y)
VALUES (2, 200, 201),
       (3, 300, 301),
       (4, 400, 401);

SELECT * FROM @source;

--- The MERGE statement:
MERGE INTO @target AS tgt
USING @source AS src
    --- ... the JOIN condition:
    ON tgt.pk=src.pk

WHEN NOT MATCHED BY SOURCE THEN
    DELETE

--- The first WHEN MATCHED condition, with an AND
WHEN MATCHED AND src.deleteMe=1 THEN
    DELETE

--- The second WHEN MATCHED catches the remaining matched rows
WHEN MATCHED THEN
    UPDATE
    SET tgt.x=src.x,
        tgt.y=src.y

WHEN NOT MATCHED BY TARGET THEN
    INSERT (pk, x, y)
    VALUES (pk, x, y);

--- The resulting target table:
SELECT * FROM @target;

Using MERGE with foreign key constraints

If you’re MERGE’ing into a table which, in turn, is referenced by a foreign key constraint, there is another performance pitfall to watch out for. Whenever you delete rows from a primary key table that is referenced by foreign key constraints, SQL Server has to “assert” (i.e. verify) that the delete operation does not create any orphans in other tables. In my experience, this is true even if no rows at all are actually deleted. This assert operation takes a lot of processing power, which can potentially really make MERGE operations against primary key tables slow.

Here’s an example with one primary key table and three foreign key tables that refer to it using foreign key constraints:

--- This is our test primary key table:
CREATE TABLE dbo.pk (
    pk       int NOT NULL,
    PRIMARY KEY CLUSTERED (pk));

INSERT INTO dbo.pk (pk) VALUES (1), (2), (3);

--- Foreign key table 1:
CREATE TABLE dbo.fk1 (
    pk       int IDENTITY(1, 1) NOT NULL,
    _fk_ID   int NOT NULL,
    PRIMARY KEY CLUSTERED (pk),
    FOREIGN KEY (_fk_ID) REFERENCES dbo.pk (pk));

INSERT INTO dbo.fk1 (_fk_ID) VALUES (1), (2);

--- Foreign key table 2:
CREATE TABLE dbo.fk2 (
    pk       int IDENTITY(1, 1) NOT NULL,
    _fk_ID   int NOT NULL,
    PRIMARY KEY CLUSTERED (pk),
    FOREIGN KEY (_fk_ID) REFERENCES dbo.pk (pk));

INSERT INTO dbo.fk2 (_fk_ID) VALUES (2), (2);

--- Foreign key table 3:
CREATE TABLE dbo.fk3 (
    pk       int IDENTITY(1, 1) NOT NULL,
    _fk_ID   int NOT NULL,
    PRIMARY KEY CLUSTERED (pk),
    FOREIGN KEY (_fk_ID) REFERENCES dbo.pk (pk));

INSERT INTO dbo.fk1 (_fk_ID) VALUES (2), (3);

--- ... and our source table for the MERGE operation:
CREATE TABLE #source (
    pk   int NOT NULL,
    PRIMARY KEY CLUSTERED (pk));

INSERT INTO #source (pk) VALUES (0), (1), (2), (3), (4);

Now, take a look at the MERGE statement using “estimated query plan” (highlight it in Management Studio and press Ctrl+L):

--- MERGE with DELETE and INSERT operators:
MERGE INTO dbo.pk AS tgt
USING #source AS src ON src.pk=tgt.pk

WHEN NOT MATCHED BY SOURCE THEN DELETE

WHEN NOT MATCHED BY TARGET THEN INSERT (pk) VALUES (pk);

Notice how SQL Server joins the PK table to all three of its foreign key tables. This is a really heavy operation if one or several of the FK tables contain a lot of data. In fact, merging 10 rows into the PK table could take you all day if the table is large enough.

There are a few solutions to this problem.

  • Removing the foreign key constraints. Not really recommended, though, because foreign keys are there for a reason, after all.
  • Removing the DELETE from the MERGE statement, if you’re sure it won’t be used.
  • Removing the DELETE from the MERGE and performing it separately.

You could argue that a row can remain in the PK table even if it’s (no longer) used in the FK tables. This is a relatively easy way out, because it completely eliminates the need for a DELETE operation. Compare the following two query plans:

--- MERGE with DELETE and INSERT operators:
MERGE INTO dbo.pk AS tgt
USING #source AS src ON src.pk=tgt.pk

WHEN NOT MATCHED BY SOURCE THEN DELETE

WHEN NOT MATCHED BY TARGET THEN INSERT (pk) VALUES (pk);

--- MERGE with only the INSERT operator:
MERGE INTO dbo.pk AS tgt
USING #source AS src ON src.pk=tgt.pk

WHEN NOT MATCHED BY TARGET THEN INSERT (pk) VALUES (pk);

Though, a lot of times, you may actually have to remove the PK row. If this is something that happens very rarely, you can manually check if there is anything to remove, and only then remove those rows using an ordinary DELETE statement:

--- MERGE with only the INSERT operator...
MERGE INTO dbo.pk AS tgt
USING #source AS src ON src.pk=tgt.pk

WHEN NOT MATCHED BY TARGET THEN INSERT (pk) VALUES (pk);

--- ... and a separate DELETE query:
IF (EXISTS (SELECT pk FROM dbo.pk
        WHERE pk NOT IN (SELECT pk FROM #source)))
    DELETE FROM dbo.pk
    WHERE pk NOT IN (SELECT pk FROM #source)

Not a pretty solution, I know, but one that may boost your query performance on a day-to-day basis.

Bonus: using OUTPUT with source table columns

I found out by accident that for the MERGE operator the OUTPUT clause can also refer to columns in the source table, not just the inserted and deleted tables. This, to my knowledge, is unique for the MERGE operator – it won’t work for other DML operators like INSERT or UPDATE.

This can be really useful if you want to output the result of a MERGE operation into, say, a logging table or similar. A lot of times, you want to add data to this table that isn’t neccessarily in the merged dataset, and isn’t a constant.

For example, let’s look at the example with the deleteMe column that we used previously.

--- Example source table, with the deleteMe column:
DECLARE @source TABLE (
    pk       int NOT NULL,
    x        int NULL,
    y        int NULL,
    deleteMe bit NOT NULL,
    PRIMARY KEY CLUSTERED (pk)
);

INSERT INTO @source (pk, x, y, deleteMe)
VALUES  (1, 100, 101, 0),
        (2, 200, 201, 1),
        (3, 300, 301, 0);

--- Example target table:
DECLARE @target TABLE (
    pk       int NOT NULL,
    x        int NULL,
    y        int NULL,
    PRIMARY KEY CLUSTERED (pk)
);

INSERT INTO @target (pk, x, y)
VALUES (2, 200, 201),
       (3, 300, 301),
       (4, 400, 401);

--- The MERGE statement:
MERGE INTO @target AS tgt
USING @source AS src
    --- ... the JOIN condition:
    ON tgt.pk=src.pk

WHEN NOT MATCHED BY SOURCE THEN
    DELETE

WHEN MATCHED AND src.deleteMe=1 THEN
    DELETE

WHEN MATCHED THEN
    UPDATE
    SET tgt.x=src.x,
        tgt.y=src.y

WHEN NOT MATCHED BY TARGET THEN
    INSERT (pk, x, y)
    VALUES (pk, x, y)

--- OUTPUT clause, containing columns from the
--- "inserted" and "deleted" special tables, as
--- well as a column from the source table!
OUTPUT inserted.*, deleted.*, src.deleteMe;

Notice how the output data from inserted and deleted does not contain the “deleteMe” column, because this column does not exist in the target table. Instead, we can explicitly include it directly from the source table.

10 thoughts on “Cool MERGE features you may not know about

  1. Pingback: A Code Sample with Merge | Jingyang Li

Let me hear your thoughts!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s