This is a real-world problem that I came across the other day. In a reporting scenario, I wanted to output a number of values in an easy, human-readable way for a report. But just making a long, comma-separated string of numbers doesn’t really make it very readable. This is particularly true when there are hundreds of values.
So here’s a powerful pattern to solve that task.
Some test data
Here’s some test data in a temp table to play around with.
CREATE TABLE #integers ( i int NOT NULL, PRIMARY KEY CLUSTERED (i) ); INSERT INTO #integers VALUES (1), (2), (3), (4), (5), (8), (9), (10), (11), (12), (15), (18), (20), (21), (22), (23), (25);
Note the gaps between values (5 and 8), (12 and 15), (18 and 20) as well as (23 and 25). Ideally, I’d like to present this data like this:
1-5, 8-12, 15, 18, 20-23, 25
Finding gaps with windowed functions
A great trick that I learned from Itzik Ben-Gan is using windowed functions to identify gaps and islands in a set of values. That comes in really handy here.
Let’s start simple. The LAG() function identifies the integer value on the previous row, relative to each row (as defined by its OVER clause)
SELECT i, LAG(i, 1) OVER (ORDER BY i) AS previous_i FROM #integers;
Which gives us
i previous_i ---- ---------- 1 NULL 2 1 3 2 4 3 5 4 8 5 9 8 ...
Now, in a contiguous series of integers, the previous value of (i) will always be (i-1), right? Let’s test for that and set a bit value to 1 when there’s a gap and 0 when the series is intact.
SELECT i, (CASE WHEN LAG(i, 1) OVER (ORDER BY i)=i-1 THEN 0 ELSE 1 END) AS is_new_series FROM #integers;
Here’s the output:
i is_new_series ---- ------------- 1 1 2 0 3 0 4 0 5 0 8 1 9 0
Now, let’s make a running total of the “is_new_series” column. This running total will in effect become a “group number”.
SELECT i, SUM(is_new_series) OVER ( ORDER BY i ROWS UNBOUNDED PRECEDING) AS group_number FROM ( ... ) AS sub;
See where this is going?
i group_number ---- ------------ 1 1 2 1 3 1 4 1 5 1 8 2 9 2 10 2 11 2 12 2 15 3
Now, we can aggregate this dataset using the “group_number” column. Each group (or “island”) contains a contigous series of integers – so we can just get MIN() and MAX() values for each group:
SELECT MIN(i) AS min_i, MAX(i) AS max_i FROM ( ... ) AS sub GROUP BY group_number;
Now the data set is aggregated like this:
min_i max_i -------- -------- 1 5 8 12 15 15 18 18 20 23 25 25
Now, just add string parsing: First, prefix a comma and a space. Then, depending on if the range spans just a single member or not, we can choose to present it as “a-b” or just “a” (rather than the ugly “a-a”).
SELECT ', '+CAST(min_i AS varchar(10))+ ISNULL('-'+CAST(NULLIF(max_i, min_i) AS varchar(10)), '') FROM ( ... ) AS sub;
This returns human-readable ranges, one range for every row:
, 1-5 , 8-12 , 15 , 18 , 20-23 , 25
Getting all those rows on a single line
Finally, we’re going to wrap all this with a neat pattern using SQL Server’s built-in XML parsing to put all these ranges on a single, comma-separated line.
SELECT SUBSTRING(CAST(( ... ORDER BY min_i FOR XML PATH(''), TYPE ) AS varchar(max)), 3, 10000);
You can read more about this pattern in my post on a LISTAGG equivalent for SQL Server. Here’s the output:
1-5, 8-12, 15, 18, 20-23, 25
The complete code
Remember: read this from the inside out.
--- Turning XML into a string: SELECT SUBSTRING(CAST(( --- Turning rows into XML: SELECT ', '+CAST(min_i AS varchar(10))+ ISNULL('-'+CAST(NULLIF(max_i, min_i) AS varchar(10)), '') FROM ( --- MIN() and MAX() for each series SELECT MIN(i) AS min_i, MAX(i) AS max_i FROM ( --- Running total of "new series" bit: SELECT i, SUM(is_new_series) OVER ( ORDER BY i ROWS UNBOUNDED PRECEDING) AS group_number FROM ( --- Compare previous i to current i: SELECT i, (CASE WHEN LAG(i, 1) OVER (ORDER BY i)=i-1 THEN 0 ELSE 1 END) AS is_new_series FROM #integers ) AS sub ) AS sub GROUP BY group_number ) AS sub ORDER BY min_i FOR XML PATH(''), TYPE ) AS varchar(max)), 3, 10000);
The same thing with dates
You can use this pattern for any data type with discrete values; integers, dates, even characters in theory. Here’s the exact same code for date ranges. The only difference is that instead of “-1”, we use DATEADD() to subtract one day.
--- Turning XML into a string: SELECT SUBSTRING(CAST(( --- Turning rows into XML: SELECT ', '+CONVERT(varchar(10), min_date, 121)+ ISNULL(' to '+CONVERT(varchar(10), NULLIF(max_date, min_date), 121), '') FROM ( --- MIN() and MAX() for each series SELECT MIN([date]) AS min_date, MAX([date]) AS max_date FROM ( --- Running total of "new series" bit: SELECT [date], SUM(is_new_series) OVER ( ORDER BY [date] ROWS UNBOUNDED PRECEDING) AS group_number FROM ( --- Compare previous i to current i: SELECT [date], (CASE WHEN LAG([date], 1) OVER (ORDER BY [date])=DATEADD(day, -1, [date]) THEN 0 ELSE 1 END) AS is_new_series FROM #dates ) AS sub ) AS sub GROUP BY group_number ) AS sub ORDER BY min_date FOR XML PATH(''), TYPE ) AS varchar(max)), 3, 10000);
What about performance?
Right out of the box, the plan contains two Sort operators that I would want to eliminate to make it perform its best. The reason sorts are expensive is because they need a memory grant and they’re what’s known as “blocking”, which means they’re actually holding up the data stream.
You can clearly see the two sorts in red markings:
There are ways to tune this query, and they mainly involve eliminating the Sort operators. The first Sort is a result of the GROUP BY, the second Sort actually just reverts the stream to its original sort order.
But I’m sure this will be a great topic for another post. Stay tuned!