A while back, I was shown an absolutely gorgeous median calculation, using the new OFFSET/FETCH functionality introduced in SQL Server 2012. But this got me thinking. How can you calculate the median in an older SQL Server? Here’s an idea of how to do it using the NTILE() function.
A median function
The median is the “middle value” in a series. Not to be confused with averages, the median is simply the value in the middle of the series. If you have an odd number of values, this is straight-forward, but if you have an even number of values, there is no exact median value, so the mean value of the two middle values is used.
Example: The median of {1, 3, 7, 11, 12} is 7. The median of {1, 2, 3, 4, 12, 13} is 3.5, i.e. (3+4)/2.
NTILE(n)
The NTILE() function in T-SQL is a window function that calculates the n-tile of a value within a series of values. NTILE(100), for instance, will return the percentile of the value, where 1 are the bottom 1% values and 100 are the top 1% values. NTILE(4) returns the quartile, and so on.
When using NTILE(2), every row in the result set is either part of the “lower half”, NTILE(2)=1, or the “upper half”, NTILE(2)=2, of the partition.
It’s important to know that NTILE(n) can only distribute values evenly across all tiles if the number of values is evenly divisible by n. The remainder is distributed from tile 1 and up.
Example: Calculating NTILE(4) on the following values:
{1, 3, 4, 6, 9, 11}
.. will return the following results:
NTILE(4)=1 {1, 3} NTILE(4)=2 {4, 6} NTILE(4)=3 {9} NTILE(4)=4 {11}
So, with NTILE(2), the first tile will contain the same number of rows as the second tile, but only if the total number of rows is even. Otherwise, the first tile will contain one row more than the second. See where we’re going with this?
Calculating the median
In our case, it boils down to the following pseudo code:
- Partition the data, in our case on CustomerID.
- Assign each value to “tile” 1 or 2.
- For an odd total number of rows, return highest value in “tile” number 1.
- For an even total number of rows, return (the highest value in “tile” 1 + the lowest value in “tile” 2) / 2.
Here’s the first dataset where we calculate the NTILE(2) of the total sales amount. For the NTILE() window function, we’re partitioning by CustomerID, so we get one calculation for each customer. Because the NTILE() function is ordered, obviously, we also need to order by TotalDue.
SELECT CustomerID, SalesOrderNumber, TotalDue, NTILE(2) OVER ( PARTITION BY CustomerID ORDER BY TotalDue) AS ntile2 FROM Sales.SalesOrderHeader;
Putting this in a CTE (“orders”), we can aggregate the MAX() and MIN() of the first and second “tile” respectively, grouping the output by CustomerID.
WITH orders (CustomerID, SalesOrderNumber, TotalDue, ntile2) AS ( SELECT CustomerID, SalesOrderNumber, TotalDue, NTILE(2) OVER ( PARTITION BY CustomerID ORDER BY TotalDue) AS ntile2 FROM Sales.SalesOrderHeader) SELECT CustomerID, --- Row count for CustomerID: COUNT(*) AS [count], --- Highest value of tile 1: MAX((CASE WHEN ntile2=1 THEN TotalDue END)) AS max_ntile1, --- Lowest value of tile 2: MIN((CASE WHEN ntile2=2 THEN TotalDue END)) AS min_ntile2 FROM orders GROUP BY CustomerID;
Now we can calculate the median value of the TotalDue column, based on the row count (the “count” column), the highest value in tile 1 (“max_ntile1”) and the lowest value in tile 2 (“min_ntile2”).
WITH orders (CustomerID, SalesOrderNumber, TotalDue, ntile2) AS ( SELECT CustomerID, SalesOrderNumber, TotalDue, NTILE(2) OVER ( PARTITION BY CustomerID ORDER BY TotalDue) AS ntile2 FROM Sales.SalesOrderHeader), orders2 (CustomerID, [count], max_ntile1, min_ntile2) AS ( SELECT CustomerID, --- Row count for CustomerID: COUNT(*) AS [count], --- Highest value of tile 1: MAX((CASE WHEN ntile2=1 THEN TotalDue END)) AS max_ntile1, --- Lowest value of tile 2: MIN((CASE WHEN ntile2=2 THEN TotalDue END)) AS min_ntile2 FROM orders GROUP BY CustomerID) SELECT CustomerID, (CASE [count]%2 --- Odd number of rows WHEN 1 THEN max_ntile1 --- Even number of rows ELSE (max_ntile1+min_ntile2)/2.0 END) AS median_TotalDue FROM orders2 ORDER BY CustomerID;
You don’t really need the second CTE; I just put it there for readability.
Thanks for reading – check back next week for more, and remember to like the Facebook page to get regular updates on your feed!
Good article but one major correction is warrented.
“NTILE(100), for instance, will return the percentile of the value, where 1 are the bottom 1% values and 100 are the top 1% values…)
This is 100% incorrect. NTILE and Percent_rank are not related in any way, they are not even members of the same family of functions. NTILE is a window ranking function and does not consider ties, Percent_Rank is a rank distribution function and will ONLY assign a duplicate value to two values when they are identical. Huge difference. Yes, there are cases where NTILE(100) will return the correct answer but its coincidental; note this article: http://www.sqlservercentral.com/articles/PERCENT_RANK/141532/
Its possible using NTILE(100), for example, to have 2 identical values have different NTILE numbers ssigned to them, and therefore, diffeeent percentile rankings. That’s just not how math works.
Thanks!