The other day, I ran across the following issue: I was looking to split a value using upper case and lower case characters. Here’s an example query:
SELECT PATINDEX('%[a-z][A-Z]%', 'testValue' COLLATE Finnish_Swedish_CS_AS) AS offset
I expected this query to return the value 4, because at that offset in the string, there is a lower case character (t) followed directly by an upper case character (V). However, in practice, this query returns the value 1.
After doing some googling, the problem breaks down to how different collations work in wildcards using intervals (such as a-z or A-Z). Here’s a more specific test to prove what happens:
SELECT (CASE WHEN 'm' COLLATE Finnish_Swedish_CS_AS LIKE '[A-Z]' THEN 'Yes' ELSE 'No' END);
This query will return “yes”, even though m is not an upper case character. The following query, however, will not:
SELECT (CASE WHEN 'm' COLLATE Finnish_Swedish_CS_AS LIKE '[ABCDEFGHIJKLMNOPQRSTUVXYZ]' THEN 'Yes' ELSE 'No' END);
So what’s the difference? [A-Z] is not the same thing as [ABCD….XYZ] if you’re using a case sensitive collation order. Sure, the collation order differentiates between upper case and lower case characters, but it sorts them as a, A, b, B, c, C, …, z, Z. And the sort order is how wildcard intervals like [A-Z] work.
Sorting order with different types of collations
Here’s a practical overview of how different collations sort character values:
WITH characters (ch) AS ( SELECT 'a' UNION ALL SELECT 'A' UNION ALL SELECT 'b' UNION ALL SELECT 'B' UNION ALL SELECT 'c' UNION ALL SELECT 'C') SELECT ch, ASCII(ch) AS [ascii], DENSE_RANK() OVER (ORDER BY ch COLLATE Finnish_Swedish_CI_AS) AS ci, DENSE_RANK() OVER (ORDER BY ch COLLATE Finnish_Swedish_CS_AS) AS cs, DENSE_RANK() OVER (ORDER BY ch COLLATE Finnish_Swedish_BIN) AS bin FROM characters ORDER BY 2;
Note how the case sensitive sort order will order the characters a, A, b, B, c, C, wheras the binary sort order will order the characters by their ASCII codes, A, B, C, a, b, c.
Solution
The solution to the problem is to use a binary sort order when using [A-Z] and [a-z] wildcards.
SELECT PATINDEX('%[a-z][A-Z]%', 'testValue' COLLATE Finnish_Swedish_BIN) AS offset
Now, the query will return the value 4.
1 comment