Here’s an interesting feature I found in the code of a colleague the other day. A common task in T-SQL is eliminating duplicate records. This became a lot easier with the introduction of windowed functions way back in SQL Server 2005, such as ROW_NUMBER(), but it turns out, I’ve still been missing out on a really simple and cool solution.
The setup
Here’s a test table with a bunch of rows, some of which are duplicates.
CREATE TABLE #table ( a int NOT NULL, b int NOT NULL, c int NOT NULL, CONSTRAINT pk PRIMARY KEY CLUSTERED (a, b) ); INSERT INTO #table (a, b, c) VALUES (1, 2, 3), (1, 3, 4), (3, 4, 5), (3, 5, 6), (5, 6, 7), (5, 7, 8), (6, 8, 9);
Eliminating duplicates
Our task is to eliminate these duplicates, so the “a” column becomes unique in the table. Here’s how I would have solved this task until recently:
WITH dups (a, b, duplicate) AS (--- The a and b columns are used to join the CTE to the base table, --- the "duplicate" column is the row_number column used to identify --- all of the duplicate records in the table: SELECT a, b, ROW_NUMBER() OVER (PARTITION BY a ORDER BY b) AS duplicate FROM #table) --- Join the CTE to #table and delete all duplicate rows (where --- "duplicate" is greater than 1). DELETE tbl FROM dups INNER JOIN #table AS tbl ON dups.a=tbl.a AND dups.b=tbl.b WHERE dups.duplicate>1;
Turns out, though, you don’t really need the cumbersome self-join with the common table expression. You can simply use the DELETE statement directly on the CTE or the equivalent subquery.
--- Delete directly from a subquery: DELETE dupl FROM ( SELECT ROW_NUMBER() OVER (PARTITION BY a ORDER BY b) AS duplicate FROM #table) AS dupl WHERE duplicate>1; --- ... or the same operation, using a CTE: WITH dups (duplicate) AS ( SELECT ROW_NUMBER() OVER (PARTITION BY a ORDER BY b) AS duplicate FROM #table) DELETE FROM dups WHERE duplicate>1;
Notice how the CTE or subquery doesn’t even need unique columns in #table! Because my original solution joins a subquery with #table, we need a set of unique columns in #table (“a” and “b” in this example). Lacking this, we’d be forced to create an IDENTITY() column or similar. Using this much more elegant solution, you won’t have to write any join, so you don’t need a unique set of columns!
Comparing the two in a query plan confirms that eliminating the join is actually beneficial to the performance of the query, and also that there’s no difference between the CTE approach and the subquery, except for readability (which comes down to your personal preference).
Obviously, the performance difference won’t really be a relevant issue until you scale up your data volumes considerably from this example.
Credits
Thanks to Michael Berndtsson for the eye-opener! This is T-SQL art.