Calculating median using the NTILE function

A while back, I was shown an absolutely gorgeous median calculation, using the new OFFSET/FETCH functionality introduced in SQL Server 2012. But this got me thinking. How can you calculate the median in an older SQL Server? Here’s an idea of how to do it using the NTILE() function.

A median function

The median is the “middle value” in a series. Not to be confused with averages, the median is simply the value in the middle of the series. If you have an odd number of values, this is straight-forward, but if you have an even number of values, there is no exact median value, so the mean value of the two middle values is used.

Example: The median of {1, 3, 7, 11, 12} is 7. The median of {1, 2, 3, 4, 12, 13} is 3.5, i.e. (3+4)/2.

NTILE(n)

The NTILE() function in T-SQL is a window function that calculates the n-tile of a value within a series of values. NTILE(100), for instance, will return the percentile of the value, where 1 are the bottom 1% values and 100 are the top 1% values. NTILE(4) returns the quartile, and so on.

When using NTILE(2), every row in the result set is either part of the “lower half”, NTILE(2)=1, or the “upper half”, NTILE(2)=2, of the partition.

It’s important to know that NTILE(n) can only distribute values evenly across all tiles if the number of values is evenly divisible by n. The remainder is distributed from tile 1 and up.

Example: Calculating NTILE(4) on the following values:

{1, 3, 4, 6, 9, 11}

.. will return the following results:

NTILE(4)=1   {1, 3}
NTILE(4)=2   {4, 6}
NTILE(4)=3   {9}
NTILE(4)=4   {11}

So, with NTILE(2), the first tile will contain the same number of rows as the second tile, but only if the total number of rows is even. Otherwise, the first tile will contain one row more than the second. See where we’re going with this?

Calculating the median

In our case, it boils down to the following pseudo code:

  • Partition the data, in our case on CustomerID.
  • Assign each value to “tile” 1 or 2.
  • For an odd total number of rows, return highest value in “tile” number 1.
  • For an even total number of rows, return (the highest value in “tile” 1 + the lowest value in “tile” 2) / 2.

Here’s the first dataset where we calculate the NTILE(2) of the total sales amount. For the NTILE() window function, we’re partitioning by CustomerID, so we get one calculation for each customer. Because the NTILE() function is ordered, obviously, we also need to order by TotalDue.

SELECT CustomerID, SalesOrderNumber, TotalDue,
    NTILE(2) OVER (
        PARTITION BY CustomerID
        ORDER BY TotalDue) AS ntile2
FROM Sales.SalesOrderHeader;

Putting this in a CTE (“orders”), we can aggregate the MAX() and MIN() of the first and second “tile” respectively, grouping the output by CustomerID.

WITH orders (CustomerID, SalesOrderNumber, TotalDue, ntile2)
AS (
    SELECT CustomerID, SalesOrderNumber, TotalDue,
        NTILE(2) OVER (
            PARTITION BY CustomerID
            ORDER BY TotalDue) AS ntile2
    FROM Sales.SalesOrderHeader)

SELECT CustomerID,
    --- Row count for CustomerID:
    COUNT(*) AS [count],
    --- Highest value of tile 1:
    MAX((CASE WHEN ntile2=1 THEN TotalDue END)) AS max_ntile1,
    --- Lowest value of tile 2:
    MIN((CASE WHEN ntile2=2 THEN TotalDue END)) AS min_ntile2
FROM orders
GROUP BY CustomerID;

Now we can calculate the median value of the TotalDue column, based on the row count (the “count” column), the highest value in tile 1 (“max_ntile1”) and the lowest value in tile 2 (“min_ntile2”).

WITH orders (CustomerID, SalesOrderNumber, TotalDue, ntile2)
AS (
    SELECT CustomerID, SalesOrderNumber, TotalDue,
        NTILE(2) OVER (
            PARTITION BY CustomerID
            ORDER BY TotalDue) AS ntile2
    FROM Sales.SalesOrderHeader),

     orders2 (CustomerID, [count], max_ntile1, min_ntile2)
AS (
    SELECT CustomerID,
        --- Row count for CustomerID:
        COUNT(*) AS [count],
        --- Highest value of tile 1:
        MAX((CASE WHEN ntile2=1 THEN TotalDue END)) AS max_ntile1,
        --- Lowest value of tile 2:
        MIN((CASE WHEN ntile2=2 THEN TotalDue END)) AS min_ntile2
    FROM orders
    GROUP BY CustomerID)

SELECT CustomerID, (CASE [count]%2
    --- Odd number of rows
    WHEN 1 THEN max_ntile1
    --- Even number of rows
    ELSE (max_ntile1+min_ntile2)/2.0 END) AS median_TotalDue
FROM orders2
ORDER BY CustomerID;

You don’t really need the second CTE; I just put it there for readability.

Thanks for reading – check back next week for more, and remember to like the Facebook page to get regular updates on your feed!

Let me hear your thoughts!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s