Human-readable ranges of integers or dates

This is a real-world problem that I came across the other day. In a reporting scenario, I wanted to output a number of values in an easy, human-readable way for a report. But just making a long, comma-separated string of numbers doesn’t really make it very readable. This is particularly true when there are hundreds of values.

So here’s a powerful pattern to solve that task.

Some test data

Here’s some test data in a temp table to play around with.

CREATE TABLE #integers (
    i        int NOT NULL,
    PRIMARY KEY CLUSTERED (i)
);

INSERT INTO #integers
VALUES (1), (2), (3), (4), (5), (8), (9), (10), (11),
       (12), (15), (18), (20), (21), (22), (23), (25);

Note the gaps between values (5 and 8), (12 and 15), (18 and 20) as well as (23 and 25). Ideally, I’d like to present this data like this:

1-5, 8-12, 15, 18, 20-23, 25

Finding gaps with windowed functions

A great trick that I learned from Itzik Ben-Gan is using windowed functions to identify gaps and islands in a set of values. That comes in really handy here.

Let’s start simple. The LAG() function identifies the integer value on the previous row, relative to each row (as defined by its OVER clause)

SELECT i, LAG(i, 1) OVER (ORDER BY i) AS previous_i
FROM #integers;

Which gives us

i    previous_i
---- ----------
1    NULL
2       1
3       2
4       3
5       4
8       5
9       8
...

Now, in a contiguous series of integers, the previous value of (i) will always be (i-1), right? Let’s test for that and set a bit value to 1 when there’s a gap and 0 when the series is intact.

SELECT i, (CASE WHEN LAG(i, 1) OVER (ORDER BY i)=i-1
           THEN 0
           ELSE 1 END) AS is_new_series
FROM #integers;

Here’s the output:

i    is_new_series
---- -------------
1    1
2    0
3    0
4    0
5    0
8    1
9    0

Now, let’s make a running total of the “is_new_series” column. This running total will in effect become a “group number”.

SELECT i, SUM(is_new_series) OVER (
       ORDER BY i ROWS UNBOUNDED PRECEDING) AS group_number
FROM (
    ...
    ) AS sub;

See where this is going?

i    group_number
---- ------------
1     1
2     1
3     1
4     1
5     1
8     2
9     2
10    2
11    2
12    2
15    3

Now, we can aggregate this dataset using the “group_number” column. Each group (or “island”) contains a contigous series of integers – so we can just get MIN() and MAX() values for each group:

SELECT MIN(i) AS min_i, MAX(i) AS max_i
FROM (
    ...
    ) AS sub
GROUP BY group_number;

Now the data set is aggregated like this:

min_i    max_i
-------- --------
 1        5
 8       12
15       15
18       18
20       23
25       25

Now, just add string parsing: First, prefix a comma and a space. Then, depending on if the range spans just a single member or not, we can choose to present it as “a-b” or just “a” (rather than the ugly “a-a”).

SELECT ', '+CAST(min_i AS varchar(10))+
       ISNULL('-'+CAST(NULLIF(max_i, min_i) AS varchar(10)), '')
FROM (
    ...
    ) AS sub;

This returns human-readable ranges, one range for every row:

, 1-5
, 8-12
, 15
, 18
, 20-23
, 25

Getting all those rows on a single line

Finally, we’re going to wrap all this with a neat pattern using SQL Server’s built-in XML parsing to put all these ranges on a single, comma-separated line.

SELECT SUBSTRING(CAST((
    ...
    ORDER BY min_i
    FOR XML PATH(''), TYPE
    ) AS varchar(max)), 3, 10000);

You can read more about this pattern in my post on a LISTAGG equivalent for SQL Server. Here’s the output:

1-5, 8-12, 15, 18, 20-23, 25

The complete code

Remember: read this from the inside out.

--- Turning XML into a string:
SELECT SUBSTRING(CAST((
    --- Turning rows into XML:
    SELECT ', '+CAST(min_i AS varchar(10))+
           ISNULL('-'+CAST(NULLIF(max_i, min_i) AS varchar(10)), '')
    FROM (
        --- MIN() and MAX() for each series
        SELECT MIN(i) AS min_i, MAX(i) AS max_i
        FROM (
            --- Running total of "new series" bit:
            SELECT i, SUM(is_new_series) OVER (
                   ORDER BY i ROWS UNBOUNDED PRECEDING) AS group_number
            FROM (
                --- Compare previous i to current i:
                SELECT i, (CASE WHEN LAG(i, 1) OVER (ORDER BY i)=i-1
                           THEN 0
                           ELSE 1 END) AS is_new_series
                FROM #integers
                ) AS sub
            ) AS sub
        GROUP BY group_number
        ) AS sub
    ORDER BY min_i
    FOR XML PATH(''), TYPE
    ) AS varchar(max)), 3, 10000);

The same thing with dates

You can use this pattern for any data type with discrete values; integers, dates, even characters in theory. Here’s the exact same code for date ranges. The only difference is that instead of “-1”, we use DATEADD() to subtract one day.

--- Turning XML into a string:
SELECT SUBSTRING(CAST((
    --- Turning rows into XML:
    SELECT ', '+CONVERT(varchar(10), min_date, 121)+
           ISNULL(' to '+CONVERT(varchar(10), NULLIF(max_date, min_date), 121), '')
    FROM (
        --- MIN() and MAX() for each series
        SELECT MIN([date]) AS min_date, MAX([date]) AS max_date
        FROM (
            --- Running total of "new series" bit:
            SELECT [date], SUM(is_new_series) OVER (
                   ORDER BY [date] ROWS UNBOUNDED PRECEDING) AS group_number
            FROM (
                --- Compare previous i to current i:
                SELECT [date], (CASE WHEN LAG([date], 1) OVER (ORDER BY [date])=DATEADD(day, -1, [date])
                                THEN 0
                                ELSE 1 END) AS is_new_series
                FROM #dates
                ) AS sub
            ) AS sub
        GROUP BY group_number
        ) AS sub
    ORDER BY min_date
    FOR XML PATH(''), TYPE
    ) AS varchar(max)), 3, 10000);

What about performance?

Right out of the box, the plan contains two Sort operators that I would want to eliminate to make it perform its best. The reason sorts are expensive is because they need a memory grant and they’re what’s known as “blocking”, which means they’re actually holding up the data stream.

You can clearly see the two sorts in red markings:

Pretty integer ranges 1

Pretty integer ranges 2

Pretty integer ranges 3

There are ways to tune this query, and they mainly involve eliminating the Sort operators. The first Sort is a result of the GROUP BY, the second Sort actually just reverts the stream to its original sort order.

But I’m sure this will be a great topic for another post. Stay tuned!

Let me hear your thoughts!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s