How to build a histogram in T-SQL

Talk to SQL Server developers or DBAs about histograms, and they’ll inevitably think of index statistics. However, a task you may encounter some day is to calculate the distribution of numbers in a table. And although there’s no quick built-in function to do this, it’s not as difficult as you may think.

The first thing we’re going to need is a table with a few “random” values to play with. Here’s one with about 100 000 rows, returning the first half of a sine curve.

CREATE TABLE #values (
    value    numeric(8, 6) NOT NULL,
);

WITH n (i)
AS ( --- Recursive CTE to generate number from zero to PI in
     --- increments of .00001 of PI:
     SELECT CAST(0.0           AS numeric(28, 8)) AS i
     UNION ALL
     SELECT CAST(.00001*PI()+i AS numeric(28, 8)) FROM n WHERE i<PI())
--- ... calculate SIN(i) and dump the results into the temp table:
INSERT INTO #values (value)
SELECT SIN(i) FROM n
OPTION (MAXRECURSION 0);

Prerequisites

No matter which pattern or method we choose for this solution, we’re going to need a few basic parameters to work with. The following declarations and variable assignments are common to all the solutions listed further down.

DECLARE @interval numeric(38, 18), --- How "wide" each range is
        @min      numeric(38, 18), --- The lowest value in the table
        @max      numeric(38, 18), --- The highest value in the table
        --- Presentation:
        @levels   smallint=7;      --- How many levels/ranges to display

--- Check the source data for MIN(), MAX() and COUNT(). This
--- is a highly efficient stream aggregate.
SELECT @max=MAX(value),
       @min=MIN(value)
FROM #values;

--- Finally, calculate @interval:
SET @interval=(@max-@min)/@levels;

The @levels parameter is the number of “bars” that we want to display. Since every level/bar is of equal width, the width (stored in the @interval variable) can be calculated by dividing the difference between the highest and the lowest value by the number of levels.

Suppose, then, that you have values ranging from 50 to 120 and you want to display a histogram with 10 bars. Each bar would be (120-50)/10 = 7 wide. So the first bar would contain values from 50 to 57, the second from 57 to 64, the third 64 to 71, and so on up to 120.

Joining with a common table expression

This is probably the easiest method to grasp. Let’s start from the top: The first thing we need is a list of all the ranges displayed in the histogram. We’ll create this list using a recursive common table expression:

WITH ranges ([range], fromValue, toValue)
AS ( --- Start with @min, increment by @interval until
     --- [range] equals @levels:
     SELECT 1 AS [range], @min AS fromValue, @min+@interval AS toValue
     UNION ALL
     SELECT [range]+1, toValue, toValue+@interval
     FROM ranges
     WHERE [range]<@levels)

SELECT * FROM ranges;

For our example data (the sine curve), this query will produce the following output:

range   fromValue    toValue
------ ---------- ----------
1      -0.0000300  0.1428314
2       0.1428314  0.2856928
3       0.2856928  0.4285542
4       0.4285542  0.5714157
5       0.5714157  0.7142771
6       0.7142771  0.8571385
7       0.8571385  0.9999999

Now, all we need to do is join the “ranges” CTE to #values, counting the number of values for each range.

WITH ranges ([range], fromValue, toValue)
AS ( SELECT 1 AS [range], @min AS fromValue, @min+@interval AS toValue
     UNION ALL
     SELECT [range]+1, toValue, toValue+@interval
     FROM ranges
     WHERE [range]<@levels)

SELECT r.fromValue,
       r.toValue,
       COUNT(v.value) AS [count]
FROM ranges AS r
LEFT JOIN #values AS v ON
    r.fromValue<=v.value AND    --- v.value falls within the range r.fromValue..
    r.toValue>v.value           --- ... and r.toValue
GROUP BY r.[range], r.fromValue, r.toValue
ORDER BY r.[range];

There’s one problem with the query, though: The highest value(s) in #values will not match any row in the ranges CTE using this join logic, because v.value has to be less than r.toValue according to the join predicate. The solution is a special treatment, just for the maximum values:

FROM ranges AS r
LEFT JOIN #values AS v ON
    r.fromValue<=v.value AND    --- v.value falls within the range r.fromValue..
   (r.toValue>v.value OR        --- ... and r.toValue, or r.toValue and v.value
    r.toValue>@max-0.5*@interval AND -- are both @max.
    v.value=@max)

The @max-0.5*@interval construct above is there to eliminate rounding errors that I encountered when writing this post.

CROSS APPLY instead of JOIN

The example above, though fairly easy to read and understand, isn’t really optimized for performance. Another way to write the same query is to use CROSS APPLY (or OUTER APPLY in this case). This gives you exactly the same results, but the query optimizes slightly differently (and often a lot better).

WITH ranges ([range], fromValue, toValue)
AS ( SELECT 1 AS [range], @min AS fromValue, @min+@interval AS toValue
     UNION ALL
     SELECT [range]+1, toValue, toValue+@interval
     FROM ranges
     WHERE [range]<@levels)

SELECT r.fromValue,
       r.toValue,
       w.[count]
FROM ranges AS r
OUTER APPLY (
    SELECT COUNT(*) AS [count]
    FROM #values AS v
    --- In a CROSS/OUTER APPLY, the WHERE clause works like
    --- the JOIN condition:
    WHERE r.fromValue<=v.value AND
         (r.toValue>v.value OR
          r.toValue>@max-0.5*@interval AND
          v.value=@max)
    ) AS w
ORDER BY r.[range];

The difference between an OUTER APPLY solution and the JOIN solution is that the APPLY pattern is designed like a “cursor” of sorts, where a subquery (the APPLY’ed set) is executed once for each row in the ranges CTE. The join predicate, accordingly, is placed inside the “subquery” in the form of a WHERE clause that references the outer table(s).

Using my test data, I found the OUTER APPLY solution above to be really fast. Performance will, however, undoubtedly vary from one set of data to another.

There is a more mathematical approach to calculating a histogram as well, that may prove better suited for your purpose.

Using FLOOR() and modulo

Imagine if all the values started at zero. You could just divide each value by @interval and you’d get a number ranging from 0 to @levels. All you have to do is round that number down to get an integer. There are two ways to do this, one using the FLOOR() function, the other one by calculating the modulo. Either way, it’s practically the same operation, and will generate identical query plans in SQL Server. The pseudo-code for this looks something like this:

  • If value is @max, decrease value slightly – for instance by 0.5*@interval, so it still stays in the range.
  • Subtract @min from value – the lowest value will now be 0 and the highest will be (@max-@min).
  • Calculate one of the following two equivalent expressions:
    • @interval * FLOOR(value/@interval)
    • value – (value mod @interval)
  • Add @min to the result

For the FLOOR() solution, the T-SQL query will look like this:

SELECT @min+@interval*FLOOR(value/@interval)           AS fromValue,
       @min+@interval*FLOOR(value/@interval)+@interval AS toValue,
       COUNT(*) AS [count]
FROM (
      --- If value is @max, set value to @max-0.5*@interval
      --- Then subtract @min.
      SELECT ISNULL(NULLIF(value, @max), @max-0.5*@interval)-@min AS value
      FROM #values
      ) AS sub
GROUP BY FLOOR(value/@interval)
ORDER BY FLOOR(value/@interval);

The T-SQL for the modulo solution looks very similar:

SELECT @min+(value-value%@interval)           AS fromValue,
       @min+(value-value%@interval)+@interval AS toValue,
       COUNT(*) AS [count]
FROM (
      --- If value is @max, set value to @max-0.5*@interval
      --- Then subtract @min.
      SELECT ISNULL(NULLIF(value, @max), @max-0.5*@interval)-@min AS value
      FROM #values
      ) AS sub
GROUP BY value-value%@interval
ORDER BY value-value%@interval;

Because of the simplicity of the query plan for these two patterns, performance can be exceptionally good.

Creating a generic histogram T-SQL function

Finishing off this post, I’m going to show you how to write a table value function that performs a histogram calculation using the modulo method above. In order to pass an arbitrary number of values into a function, we’ll first need to define a user-defined table type to hold the values:

CREATE TYPE dbo.valueArray AS TABLE (
    value     numeric(38, 18) NOT NULL
);

And here’s the finished function:

CREATE FUNCTION dbo.fn_histogram (
    @values  dbo.valueArray READONLY,
    @levels  smallint=10
)
RETURNS @output TABLE (
    lowerBound     numeric(38, 18) NOT NULL,
    upperBound     numeric(38, 18) NOT NULL,
    [count]        int NOT NULL,
    PRIMARY KEY CLUSTERED (lowerBound)
)
WITH SCHEMABINDING
AS

BEGIN;
    --- Declare and set the work variables:
    DECLARE @interval numeric(38, 18),
            @min      numeric(38, 18),
            @max      numeric(38, 18);

    SELECT @min=MIN(value),
           @max=MAX(value)
    FROM @values;

    SET @interval=(@max-@min)/@levels;

    --- Calculate the histogram, using modulo, and
    --- insert the result into @output:
    INSERT INTO @output (lowerBound, upperBound, [count])
    SELECT @min+(value-value%@interval)           AS lowerBound,
           @min+(value-value%@interval)+@interval AS upperBound,
           COUNT(*)                               AS [count]
    FROM (
          --- If value is @max, set value to @max-0.5*@interval
          --- Then subtract @min.
          SELECT ISNULL(NULLIF(value, @max), @max-0.5*@interval)-@min AS value
          FROM @values
          ) AS sub
    GROUP BY value-value%@interval;

    --- ... and done:
    RETURN;
END;

SCHEMABINDING helps the query processor to further optimize the solution for performance. I’ve deliberatly left out a clustered index on the table type because this operation doesn’t benefit from ordered data (in fact, the sort operation may even slow things down). And everything we do on the table is a start-to-finish full table scan on @values anyway.

To test the function, try this:

--- Declare a table variable using the user-defined table type:
DECLARE @test dbo.valueArray;

--- Fill @test with some data:
WITH n (i)
AS ( SELECT CAST(0.0 AS numeric(28, 8)) AS i UNION ALL
     SELECT CAST(.00001*PI()+i AS numeric(28, 8)) FROM n WHERE i<PI())

INSERT INTO @test (value)
SELECT SIN(i) FROM n
OPTION (MAXRECURSION 0);

--- .. and call the histogram function:
SELECT *
FROM dbo.fn_histogram(@test, 16);

If you want the fancy bar chart as well, you can try this. ;)

SELECT lowerBound, upperBound, [count],
       --- Percent of total row count:
       STR(100.0*[count]/SUM([count]) OVER (PARTITION BY NULL), 5, 2)+'%' AS [percent],
       --- Bar diagram, width between 0 and 50 characters:
       REPLICATE(':', 50*[count]/MAX([count]) OVER (PARTITION BY NULL)) AS barChart
FROM dbo.fn_histogram(@test, 16);

The REPLICATE() function has nothing to do with SQL Server replication; it just copies a string a given number of times. In this case, we’re using it to create a horizontal bar for the histogram.

Check back next week for more T-SQL goodies, and don’t forget to leave your feedback in the comments section below, and like Sunday Morning T-SQL on Facebook!

One thought on “How to build a histogram in T-SQL

  1. Pingback: Statistics 101 for DB Dummies | Davide Moraschi

Let me hear your thoughts!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s