Using MERGE as a single INSERT or UPDATE

The MERGE statement is a very powerful way to combine INSERT, UPDATE and/or DELETE in a single statement. But there’s more than meets the eye. There are situations where you could use a MERGE statement to perform just one of those tasks.

One of the features of the MERGE statement that I’ve learned to love is the ability to reference columns from the source table that aren’t in the inserted or deleted tables in the OUTPUT clause. Let’s jump straight in, and set up some test data:

--- This is a sample table that we're going to fill with
--- a large number of random integer values:
CREATE TABLE #test (
    [row]  int IDENTITY(1, 1) NOT NULL,
    a      int NOT NULL,
    b      int NOT NULL,
    PRIMARY KEY CLUSTERED ([row])
);

--- One row to start out with:
INSERT INTO #test (a, b) VALUES (1, 2);

--- ... and fill up to about 262 000 rows:
WHILE (@@ROWCOUNT<100000)
   INSERT INTO #test (a, b)
   SELECT 100*RAND(1000*a*RAND()),
          100*RAND(1000*b*RAND())
   FROM #test;

Using INSERT with OUTPUT

Say we want to populate another table, #target, with all distinct values of b from #test. Here’s the table:

CREATE TABLE #target (
    b      int NOT NULL,
    PRIMARY KEY CLUSTERED (b)
);

… and a good way to accomplish the task:

INSERT INTO #target (b)
SELECT DISTINCT b
FROM #test;

Now, let’s imagine we want to use OUTPUT to actually display what we’ve inserted.

INSERT INTO #target (b)
OUTPUT inserted.b
SELECT DISTINCT b
FROM #test;

So far so good. What about other columns? For each distinct b, I want to use OUTPUT to display MIN([row]) and MAX([row]) from the original source table. That would probably look something like this:

--- A common table expression that aggregates
--- the MIN() and MAX() of the row column for
--- each distinct b:
WITH src (b, min_row, max_row)
AS (SELECT b, MIN([row]), MAX([row])
    FROM #test
    GROUP BY b)

--- Using that common table expression, insert
--- a row for each b..
INSERT INTO #target (b)
--- ... and OUTPUT all three columns from the CTE:
OUTPUT inserted.b, t.min_row, t.max_row
SELECT t.b
FROM #src AS t
GROUP BY t.b;

Nope.

All we get is an error message, because the OUTPUT clause cannot reference the min_row and max_row columns in the common table expression. This is because OUTPUT can only use columns that exist in inserted or deleted. inserted and deleted are two virtual system tables that contain all the rows affected by the statement (you’ll find them in triggers as well). inserted contains rows after an INSERT or UPDATE, while deleted contains the original rows, before an UPDATE or DELETE.

Using MERGE

The MERGE statement, however, can actually use columns that aren’t in inserted or deleted. Now, all we have to do is make it work just like an INSERT, right? Here’s how you would build a MERGE statement, by the book:

--- Our source data, in a common table expression:
WITH src (b)
AS (SELECT DISTINCT b
    FROM #test)

--- The MERGE statement, with the target:
MERGE INTO #target AS t
USING src AS s ON s.b=t.b

--- ... and we're only interested in the INSERT bit:
WHEN NOT MATCHED BY TARGET THEN
    INSERT (b) VALUES (s.b);

This query, however, LEFT JOINs the two tables #test and #target, in order to identify which records already exist in #target, but that’s not the same thing as an INSERT. Remember, an INSERT doesn’t really care if the records are already in the table. The solution is to make sure that every row in the MERGE results in a “not matched by target” state, not just the ones that don’t exist in #target. You can do that by changing the join condition:

--- Our source data, in a common table expression:
WITH src (b)
AS (SELECT DISTINCT b
    FROM #test)

--- The MERGE statement, with the target:
MERGE INTO #target AS t
USING src AS s ON 1=0  --- This condition is never true..

--- ... which means this will always happen:
WHEN NOT MATCHED BY TARGET THEN
    INSERT (b) VALUES (s.b);

The “ON 1=0” construct ensures that the join condition is never true, which means that rows from the source CTE will never match a target row, which means that we’ll end up with an INSERT for each of the source rows. It’s technically still a JOIN between two tables if you read the query text, but the query optimizer can tell what’s going on and skips the JOIN entirely, giving you a more efficient plan.

Here are the two queries for comparison; first the one with s.b=t.b, then the other one with 1=0:

MERGE-INSERT example 1

MERGE-INSERT example 2

Note that eliminating the join operator in the diagram has simplified and optimized the query plan. As soon as the volumes add up, this will also reflect in the performance of your query.

But back to our original topic: The MERGE statement allows us to use OUTPUT on columns other than the ones found in the special inserted and deleted tables. In the following example, I’ve used the original INSERT statement (the one that failed) and rebuilt it, so it uses a MERGE statement instead, which allows us to return min_row and max_row:

WITH src (b, min_row, max_row)
AS (SELECT b, MIN([row]), MAX([row])
    FROM #test
    GROUP BY b)

MERGE INTO #target AS t
USING src AS s ON 1=0

WHEN NOT MATCHED BY TARGET THEN
    INSERT (b) VALUES (b)

OUTPUT inserted.b, s.min_row, s.max_row;

For more on the MERGE statement, read this article. If you want to know more about the inserted and deleted tables, read . As usual, I look forward to your feedback, and make sure to tune in next week for another post!

1 comment

Leave a comment

Your email address will not be published. Required fields are marked *