Aggregate string concatenations

A common reporting scenario is that your report has to concatenate an aggregate of string values from rows. Many other database platforms even have built-in aggregate functions that will concatenate text for you (like LISTAGG() on Oracle). In this post, we’ll take a look at how you can achieve the same results in T-SQL, using the APPLY operator and the XML datatype.

The following query in the AdventureWorks database illustrates what we’re trying to do.

SELECT cust.CustomerID, hdr.SalesOrderNumber
FROM Sales.Customer AS cust
LEFT JOIN Sales.SalesOrderHeader AS hdr ON
    cust.CustomerID=hdr.CustomerID;

For customers that have more than one SalesOrderNumber, we want to create a list of SalesOrderNumbers, so the output always contains one row per customer.

The solution we’re going to look at today is XML based. Because SQL Server can convert multiple rows of data into an XML set (which, effectively, is a long string of text), this method has the potential to be very efficient, compared to doing a lot of string parsing in recursive/iterative code.

The following code uses CROSS APPLY to generate an XML document for each CustomerID. Each XML document, in turns, contains one element for each row in the SalesOrderHeader table.

SELECT cust.CustomerID, list.*
FROM Sales.Customer AS cust
CROSS APPLY (
    --- This code generates an XML document for each
    --- row in "cust":
    SELECT hdr.SalesOrderNumber AS ListItem
    FROM Sales.SalesOrderHeader AS hdr
    WHERE hdr.CustomerID=cust.CustomerID
    ORDER BY hdr.SalesOrderNumber
    FOR XML PATH(''), TYPE
    ) AS list(xmlDoc);

Also, note the ORDER BY clause in the FOR XML subquery, ensuring proper ordering of the elements. Here’s a sample XML output for one customer:

<ListItem>SO51859</ListItem>
<ListItem>SO57167</ListItem>
<ListItem>SO63278</ListItem>
<ListItem>SO69530</ListItem>

(Line breaks added by author for readability)

Now, let’s add a comma to each item, so we can tell them apart later on, when we concatenate them:

SELECT cust.CustomerID, list.*
FROM Sales.Customer AS cust
CROSS APPLY (
    --- This code generates an XML document for each
    --- row in "cust":
    SELECT ', '+hdr.SalesOrderNumber AS ListItem
    FROM Sales.SalesOrderHeader AS hdr
    WHERE hdr.CustomerID=cust.CustomerID
    ORDER BY hdr.SalesOrderNumber
    FOR XML PATH(''), TYPE
    ) AS list(xmlDoc);

Which yields the following XML result:

<ListItem>, SO51859</ListItem>
<ListItem>, SO57167</ListItem>
<ListItem>, SO63278</ListItem>
<ListItem>, SO69530</ListItem>

Finally, we’ll extract this element from the XML and return it in the query as a varchar. When we do this, we could extract the individual ListItem elements and then do some string parsing, but the value() function allows us to retrieve all elements in a single concatenated string instead, by specifying the XPath “.” (which basically means “here”). The end product looks like this:

SELECT cust.CustomerID,
       SUBSTRING(
           list.xmlDoc.value('.', 'varchar(max)'),
           3, 10000) AS SalesOrderNumbers
FROM Sales.Customer AS cust
CROSS APPLY (
    --- This code generates an XML document for each
    --- row in "cust":
    SELECT ', '+hdr.SalesOrderNumber AS ListItem
    FROM Sales.SalesOrderHeader AS hdr
    WHERE hdr.CustomerID=cust.CustomerID
    ORDER BY hdr.SalesOrderNumber
    FOR XML PATH(''), TYPE
    ) AS list(xmlDoc);

The SUBSTRING() construct is used to trim away the leading comma in the first element – the length argument (10 000) is just arbitrary, to get the “remainder” of the string. If you expect long strings, you should probably use the LEN() of the string here instead.

And, to be fair, you could solve this without the CROSS APPLY, but it’s a pretty coding pattern in my opinion.

Performance

XML encoding, used like this, is a very economical alternative to building a recursive common table expression or similar solutions, because SQL Server has built-in support to do these concatenations, and that keeps the complexity of the query on a manageable level. Out of curiosity, a recursive CTE that produces the same results wouldn’t complete for minutes on my server, whereas the above query completes in less than a second with 20 000 rows.

When you look at their respective estimated query plans, you can clearly see why. First, the plan for the XML and APPLY-based query:

Aggregate concatenation queryplan, XML-based

And, for comparison, the one generated by the recursive common table expression.

Aggregate concatenation queryplan, rCTE-based

For more articles like this one, make sure to check out the rest of sqlsunday.com, and remember to “like” the Facebook page to get notifications of new posts!

8 comments

    1. If the SalesOrderNumber column is NULL, you could use the ISNULL() function. In my post, I try to keep the code as simple as possible for readability.

  1. Thanks! Just what I needed.
    By the way, instead of using SUBSTRING to get rid of the leading delimiter you can use the STUFF function. That way you avoid calcuating the LEN of your value.

    STUFF(
    list.xmlDoc.value(‘.’, ‘varchar(max)’),
    1,2,”) AS SalesOrderNumbers

Leave a comment

Your email address will not be published. Required fields are marked *