Creating a top-10 list and aggregating the remainder

This article will show you how to create a “top 10” list of data, as well as providing an aggregate on the remainder of the data. There are a few different approaches to this.

The example queries in this post are all built on the AdventureWorks database, but you can easily adapt them to pretty much any other table you like.

The simple TOP 10 list

The simplest query is a regular SELECT query with a TOP keyword, but it won’t provide the “remainder” aggregate that we’re looking for:

SELECT TOP 10 SalesOrderID, SalesOrderNumber, SubTotal
FROM Sales.SalesOrderHeader
ORDER BY SubTotal DESC;

Using a common table expression and windowed functions

If you use the above query within a common table expression, you can write a UNION of two queries; one with the top 10 list, and the other with the aggregate of the remaining rows.

The ordinal number (1st place, 2nd place, etc) in the common table expression is calculated using a ROW_NUMBER() windowed function. If you’re using GROUP BY in the CTE, you might have to change ROW_NUMBER() to DENSE_RANK().

WITH topHeaders (SalesOrderNumber, SubTotal, Ordinal)
AS (--- Give each row an ordinal number
    SELECT SalesOrderNumber, SubTotal,
           ROW_NUMBER() OVER (ORDER BY SubTotal DESC) AS Ordinal
    FROM Sales.SalesOrderHeader)

--- The top 10 rows:
SELECT SalesOrderNumber, SubTotal, Ordinal
FROM topHeaders
WHERE Ordinal<=10
UNION ALL
--- .. and the remaining aggregate
SELECT 'Others', SUM(SubTotal), NULL
FROM topHeaders
WHERE Ordinal>10;

This query might not be optimal if your source data contains a lot of rows, because it will enumerate the entire table with ordinal numbers. There are, however, few simple solutions to this performance problem.

Using a temp table to optimize performance

With large quantities of data, using a temp table could actually lead to a query plan that is more optimized for performance. This is not a given, because it may vary from case to case. As usual, use the estimated query plan to see which solution is best for you.

--- This is the temp table we're going to be using:
CREATE TABLE #top10 (
    SalesOrderNumber   nvarchar(50) NOT NULL,
    SubTotal           money NOT NULL,
    Ordinal            int NOT NULL,
    PRIMARY KEY CLUSTERED (Ordinal)
);

--- Adding the top 10 rows into the temp table:
INSERT INTO #top10 (SalesOrderNumber, SubTotal, Ordinal)
SELECT TOP 10 SalesOrderNumber, SubTotal,
    ROW_NUMBER() OVER (ORDER BY SubTotal DESC) AS Ordinal
FROM Sales.SalesOrderHeader
ORDER BY SubTotal DESC;

--- Rows from the #top10 temp table:
SELECT SalesOrderNumber, SubTotal, Ordinal
FROM #top10
UNION ALL
--- .. and the remainder:
SELECT NULL,
    --- The remainder is the grand total of the base table,
    --- minus the sum of the top 10.
    (SELECT SUM(SubTotal) FROM Sales.SalesOrderHeader)-
    (SELECT SUM(SubTotal) FROM #top10) AS SubTotal, NULL;

DROP TABLE #top10;

The main difference between this query and the previous one is that when populating the temp table, we’re using the TOP operator along with ORDER BY to limit the number of rows that we calculate the ROW_NUMBER() on. This should, in theory, reduce the workload for the SQL Server when there is a large volume of rows in the table.

1 comment

Leave a comment Cancel reply