Know your collation orders

We’ve discussed earlier on the effect of using SQL wildcards, such as ‘%[a-z]%’ with different collation orders, particularly case sensitive ones. Here’s another take on collations, and how different collation orders can sort text differently.

The SQL Server collation order, like the name implies, determines the sort order of ordered sets and how comparisons between data are performed, i.e. which two characters/strings are alike, and which aren’t. If, for instance, you’re german-speaking, you may want the two following texts to be equal:

  • “meine Füße”
  • “meine Fuesse”

This is because, in german, the double-S (ß) can roughly be substituted with “ss”, and the “ü” with “ue”. You can verify this with a simple SQL query that uses collations to compare the two strings:

SELECT
    (CASE WHEN 'meine Füße' COLLATE Latin1_General_CI_AS
        ='meine Fuesse' COLLATE Latin1_General_CI_AS
            THEN 'Yep' ELSE 'Nope' END) AS in_latin1,

    (CASE WHEN 'meine Füße' COLLATE German_PhoneBook_CI_AS
        ='meine Fuesse' COLLATE German_PhoneBook_CI_AS
            THEN 'Yep' ELSE 'Nope' END) AS in_german;

Notice that the two strings are equal when using a german collation, but not with most others.

Here’s a query that gives you a quick overview in the form of an ASCII table where all the characters are ordered by a few common collations. The first common table expression, “all_ascii_chars” is a recursive CTE that builds a set with all characters from CHAR(1) to CHAR(255). I’ve UNION’ed this set with a few manual additions of my own in the second CTE, “list_with_extra_chars”, below. Finally, DENSE_RANK() is used to give each character an ordinal for each respective collation order.

WITH all_ascii_chars ([ascii], c, is_anchor)
AS (
    --- Start with CHAR(1) as the anchor..
    SELECT 1 AS [ascii], CHAR(1) AS c, 1 AS is_anchor
    UNION ALL
    --- ... and add another character for every ASCII code
    --- up to 255 with the recursive part of the common
    --- table expression:
    SELECT [ascii]+1, CHAR([ascii]+1) AS c, 1 AS is_anchor
    FROM all_ascii_chars
    WHERE is_anchor=1 AND [ascii]<255),

     list_with_extra_chars (c)
AS (
    --- The above list of characters (but only lower-case A-Z)..
    SELECT CAST(c AS varchar(10))
    FROM all_ascii_chars
    WHERE c COLLATE Finnish_Swedish_CS_AI LIKE '[a-zåäöü]' AND
        c COLLATE Finnish_Swedish_CS_AI=
            LOWER(c COLLATE Finnish_Swedish_CS_AI)

    --- unioned with our own "special characters":    
    UNION ALL
    SELECT CAST('aa' AS varchar(10)) AS c UNION ALL
    SELECT CAST('ss' AS varchar(10)) AS c UNION ALL
    SELECT CAST('ae' AS varchar(10)) AS c UNION ALL
    SELECT CAST('oe' AS varchar(10)) AS c UNION ALL
    SELECT CAST('ue' AS varchar(10)) AS c)

--- And compile them all, displaying different collation orders
--- using DENSE_RANK to visualize the differences:
SELECT c,
    DENSE_RANK() OVER (ORDER BY c COLLATE German_PhoneBook_CI_AS)
        AS German_PhoneBook_CI_AS,
    DENSE_RANK() OVER (ORDER BY c COLLATE Finnish_Swedish_CI_AS)
        AS Finnish_Swedish_CI_AS,
    DENSE_RANK() OVER (ORDER BY c COLLATE Danish_Norwegian_CI_AS)
        AS Danish_Norwegian_CI_AS,
    DENSE_RANK() OVER (ORDER BY c COLLATE French_CI_AS)
        AS French_CI_AS,
    DENSE_RANK() OVER (ORDER BY c COLLATE Latin1_General_CI_AS)
        AS Latin1_General_CI_AS
FROM list_with_extra_chars
--- Sort the table by the Latin1_General column
ORDER BY 6
OPTION (MAXRECURSION 0);

The result of this query will hopefully illustrate the importance of keeping track of your collations. Finally, if you want a comprehensive list of all collations available, use the following T-SQL query:

SELECT * FROM sys.fn_helpcollations();

Let me hear your thoughts!

This site uses Akismet to reduce spam. Learn how your comment data is processed.