Collations and case sensitive wildcards

The other day, I ran across the following issue: I was looking to split a value using upper case and lower case characters. Here’s an example query:

SELECT PATINDEX('%[a-z][A-Z]%',
    'testValue' COLLATE Finnish_Swedish_CS_AS) AS offset

I expected this query to return the value 4, because at that offset in the string, there is a lower case character (t) followed directly by an upper case character (V). However, in practice, this query returns the value 1.

After doing some googling, the problem breaks down to how different collations work in wildcards using intervals (such as a-z or A-Z). Here’s a more specific test to prove what happens:

SELECT (CASE
    WHEN 'm' COLLATE Finnish_Swedish_CS_AS
        LIKE '[A-Z]'
    THEN 'Yes' ELSE 'No' END);

This query will return “yes”, even though m is not an upper case character. The following query, however, will not:

SELECT (CASE
    WHEN 'm' COLLATE Finnish_Swedish_CS_AS
        LIKE '[ABCDEFGHIJKLMNOPQRSTUVXYZ]'
    THEN 'Yes' ELSE 'No' END);

So what’s the difference? [A-Z] is not the same thing as [ABCD….XYZ] if you’re using a case sensitive collation order. Sure, the collation order differentiates between upper case and lower case characters, but it sorts them as a, A, b, B, c, C, …, z, Z. And the sort order is how wildcard intervals like [A-Z] work.

Sorting order with different types of collations

Here’s a practical overview of how different collations sort character values:

WITH characters (ch)
AS (
    SELECT 'a' UNION ALL
    SELECT 'A' UNION ALL
    SELECT 'b' UNION ALL
    SELECT 'B' UNION ALL
    SELECT 'c' UNION ALL
    SELECT 'C')

SELECT ch,
    ASCII(ch) AS [ascii],
    DENSE_RANK() OVER (ORDER BY ch COLLATE Finnish_Swedish_CI_AS) AS ci,
    DENSE_RANK() OVER (ORDER BY ch COLLATE Finnish_Swedish_CS_AS) AS cs,
    DENSE_RANK() OVER (ORDER BY ch COLLATE Finnish_Swedish_BIN) AS bin
FROM characters
ORDER BY 2;

Note how the case sensitive sort order will order the characters a, A, b, B, c, C, wheras the binary sort order will order the characters by their ASCII codes, A, B, C, a, b, c.

Solution

The solution to the problem is to use a binary sort order when using [A-Z] and [a-z] wildcards.

SELECT PATINDEX('%[a-z][A-Z]%',
    'testValue' COLLATE Finnish_Swedish_BIN) AS offset

Now, the query will return the value 4.

1 comment

Leave a comment

Your email address will not be published. Required fields are marked *