Back in 2014 I wrote a blog post on how to calculate a median value using the NTILE window function. It’s far from the best performing solution there is, but it worked on SQL Server 2008, before the introduction of OFFSET-FETCH i SQL Server. In this post, I’m going to look at creating a generalized function that calculates the median (or any percentile) of a series of values.
The math
Medians as a concept are simple enough. If you have a large number of values, like a range of statistical values, you want to pick the middle one. The median, as opposed to the average is useful for a number of reasons, one of them that you can reduce the effect of so-called outlier values.
For example, in the range {1, 2, 3, 4, 5, 6, 7, 8, 9, 1000}, the average will be 104.5, but that’s really not an accurate reflection on the range. The reason that the average is “skewed” is obviously the number 1000, known as an outlier in this range. The median is the “middle” number in the range, and because we have an even number of values in this example, the median is going to be the average of the two middle values, {5, 6}, i.e. 5.5. This is the most common way to calculate the median of an even number of values, but not the only one. And as I’m not a mathematician, I won’t go into detail on the different methods and applications.
The function
The most elegant (and most efficient) way to calculate a median in SQL Server that I’ve seen so far is using the new OFFSET-FETCH clause, which is available as of SQL Server 2012. In an ordered set, OFFSET determines the starting row number and FETCH the number of rows returned from that starting row. For example, this query will return the 10 next rows starting with row 100 in the defined order:
SELECT * FROM Booking.Tickets ORDER BY TicketID OFFSET 100 ROWS FETCH NEXT 10 ROWS ONLY;
Now imagine if all the statistical values that we want to calculate a median on are ordered, and given that we know the number of values, you could use OFFSET and FETCH to return just the middle one or two values from that set.
SELECT AVG(val) FROM ( SELECT t.val FROM @tbl AS t ORDER BY t.val OFFSET (@count-1)/2 ROWS FETCH NEXT 2-@count%2 ROWS ONLY ) AS sub;
Let’s break this query down:
- @tbl contains all of the values in the column val.
- @count is the number of rows in @tbl. We’re going to use this @count to calculate the correct offset and number of rows to return.
- (@count-1)/2 performs an integer division of (@count-1), so for odd @count values, it’ll round the result down. In this case, when @count is 10, the OFFSET (10-1)/2 will be rounded down to 4, and if @count is 9, (9-1)/2 is also 4.
- 2-@count%2 returns 2 for even values of @count and 1 for odd values of @count. Remember that when we have an even number of values, we want the average of the two middle values, whereas if @count is odd, simply a single middle value is used.
- Finally, the subquery is aggregated with an AVG() for those cases when we have two values.
Passing a table of values to the function
The trickiest part when building your own aggregate function (which, for all practical purposes, this is) without using CLR is passing multiple records as a parameter to the function. There are two ways to accomplish this.
You could create a user-defined table type, declare a table variable from that table type, populate this variable and then pass that variable to the function. The problem with this approach is that in a query with a grouping expression, you cannot repopulate the variable for each group.
To solve this, I’m going for an XML-based method instead. Passing all the values in an XML blob for each group is slightly less efficient, but instead provides you with much more latitude when writing grouped queries.
No matter which of these two approaches one chooses, it’s going to be a bit awkward to call the function. But that’s how workarounds go.
Scalar or table value function?
Another issue to tackle: Should we make this a scalar function (returning a single scalar value) or a table valued function (returning a recordset, in this case with a single row and single column)? Common wisdom has it that scalar functions and multi-statement table valued functions are terrible performers, mostly because they’re called once for every single row. Inline table valued functions tend to expand like parameterized views and can often be much more efficient in certain situations.
I built the function in these three different ways and benchmarked them calculating the median exchange rate for each currency from two tables, Currencies (with 11 rows) and CurrencyRates (about 2000 rows per currency). All of the queries did a single scan on Currencies with just two pages, none of them used any parallel logic. I would expect the inline and multi-statement table value functions to eventually parallelize if we ramped up the volumes significantly, because of the CROSS APPLY operation.
Here’s how they compared overall:
- Scalar function: 92 reads (11 scans) on CurrencyRates.
- Multi-statement table valued function: 92 reads (11 scans) on CurrencyRates, plus 22 logical reads (11 scans) on a temp table.
- Inline table valued function: 552 reads (66 scans) on CurrencyRates.
The inline table valued function was terrible in terms of performance. While the other two came in at about 400 ms each, the inline table valued function took a whopping 64 seconds to complete. The reason is that it’s defined as a single SELECT statement, which negates the use of any (properly indexed) work table, and duplicates the XML parsing (once to get the values and again to count them).
Typed XML
Since we’re passing an XML argument, we want to be able to parse this XML blob as efficiently as possible, and for this purpose, I’ve created a schema collection. You could do without one, but applying the schema will dramatically improve the XML parsing performance.
The schema is very simple – a bunch of <value> elements with numeric (“double”) values.
CREATE XML SCHEMA COLLECTION Calc.ValueListSchema AS ' <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="value" type="xsd:double" /> </xsd:schema>';
Here’s an example of what the XML argument could look like:
<value>1.0</value> <value>2.0</value> <value>3.0</value> ...
The median function
Here’s the finished product (in the form of a multi-statement table valued function).
CREATE FUNCTION Calc.Median( @values xml(Calc.ValueListSchema)=NULL ) RETURNS @out TABLE ( median numeric(28, 8) NOT NULL, PRIMARY KEY CLUSTERED (median) ) WITH SCHEMABINDING AS BEGIN; --- Variable declaration DECLARE @rowcount int=0; --- Our work table where we're going to store all the values DECLARE @tbl TABLE ( val numeric(28, 8) NOT NULL, ident int IDENTITY(1, 1) NOT NULL, PRIMARY KEY CLUSTERED (val, ident) ); --- Extract the values from the XML blob into @tbl INSERT INTO @tbl (val) SELECT v.n.value('.', 'numeric(28, 8)') AS val FROM @values.nodes('/value') AS v(n); SET @rowcount=@@ROWCOUNT; IF (@rowcount=0) RETURN; --- Calculate the median, store the single output row in @out INSERT INTO @out (median) SELECT AVG(val) FROM ( SELECT t.val FROM @tbl AS t ORDER BY t.val OFFSET (@rowcount-1)/2 ROWS FETCH NEXT 2-@rowcount%2 ROWS ONLY ) AS sub; RETURN; END;
Note: schemabinding a function can improve performance even if it never references any objects in the database. Because SQL Server knows that this object won’t reference any so-called user data, it can sometimes eliminate Table Spool operators in the query plan, which gives you a more efficient plan.
Calling the function
Calling the function involves creating an XML blob from the argument values, then APPLY’ing the function with that XML blob as the argument. Again, the example with currencies and currency rates:
SELECT c.Currency, c.Name, m.median FROM Finance.Currencies AS c --- For each currency, apply this: CROSS APPLY Calc.Median(( --- XML element must be named "value": SELECT r.[Rate] AS [value] FROM Finance.CurrencyRates AS r WHERE r.CurrencyID=c.CurrencyID --- XML conversion happens here: FOR XML PATH(''), TYPE)) AS m;
The percentile function
We can easily modify the function to calculate any percentile, not just the median (the 50th percentile). Here’s how:
CREATE FUNCTION Calc.Percentile( @values xml(Calc.ValueListSchema)=NULL, @percentile int=50 ) RETURNS @out TABLE ( percentile numeric(28, 8) NOT NULL, PRIMARY KEY CLUSTERED (percentile) ) WITH SCHEMABINDING AS BEGIN; --- Variable declaration DECLARE @rowcount int=0; --- Our work table where we're going to store all the values DECLARE @tbl TABLE ( val numeric(28, 8) NOT NULL, ident int IDENTITY(1, 1) NOT NULL, PRIMARY KEY CLUSTERED (val, ident) ); --- Extract the values from the XML blob into @tbl INSERT INTO @tbl (val) SELECT v.n.value('.', 'numeric(28, 8)') AS val FROM @values.nodes('/value') AS v(n); SET @rowcount=@@ROWCOUNT; IF (@rowcount=0) RETURN; --- Calculate the percentile, store the single output row in @out INSERT INTO @out (percentile) SELECT AVG(val) FROM ( SELECT t.val FROM @tbl AS t ORDER BY t.val OFFSET @percentile*(@rowcount-1)/100 ROWS FETCH NEXT 2-@rowcount%2 ROWS ONLY ) AS sub; RETURN; END;
Acknowledgements
The brilliant OFFSET-FETCH solution to percentiles that this post is based on was first shown to me by Itzik Ben-Gan in one of his presentations. He also wrote a good post on SQL Server Magazine.
1 comment