A client asked me about SQL Server collations, and if they should consider the new UTF8 collations (new since SQL Server 2019). I tried to hide my blank stare of ignorance, and promised them I’d look it up and get back to them.
Not gonna lie, I think UTF and Unicode can be pretty confusing at times, so I did some googling and some testing, and here’s what I found.
What’s the deal?
EDIT: The comment section adds a lot of useful in-depth information and corrections to some errors I’ve made in the post.
Ages ago, humanity defined (for the purposes of computing, at least) that 8 bits would be just the right size to store any character you could think of in an electronic medium. Well, actually, it used to be just 7 bits at some point, but I digress. 8 bits gives you the possibility to construct any one of 28 characters in a byte. Using the ASCII table, we could translate any character to an integer value, or vice versa.
Did I say “any character”? I meant to say character in a small subset of western languages, because surely nobody in Japan would be using computers, right? In hindsight, this was a pretty dumb omission, albeit every byte was precious in the years before cloud computing.
The bolt-on solution to fix this omission was “code pages”. Essentially, those 255 different character codes could be defined to mean different things depending on how you read them. You see where this is going, right?
In the end, some committee finally figured out that we could use more than 8 bits to store text values. I know, right? But turning every single piece of text on the planet into 32-bit blocks was not really a practical idea for a number of reasons, least of which was storage. So they devised a way to declare a character using 8, 16, 24 or 32 bits, depending on how “exotic” it is – this is what’s known as a “variable-length encoding”. This is UTF-8.
Some definitions
At this point, I need to define some words:
- A “character” is one letter in the alphabet, one emoji or one kanji, depending on your writing preference.
- A “byte” is the basic storage unit of 8 bits.
- “ASCII” is the legacy 8-bit characterset.
- “Unicode” is the standardized characterset that encompasses all (current) languages and writing styles.
There are three flavours of UTF:
- UTF-8 is variable-length. Common western characters are 8 bits, but characters can be anything between 8 and 32 bytes.
- UTF-16 is fixed-width: every character takes up 16 bits (two bytes)
- UTF-32 is fixed-with: always 32 bits, but not very common.
SQL Server’s nvarchar datatype uses UTF-16, and so does Windows. It’s easy and predictable, but it does take up a bit of storage and RAM.
varchar and nvarchar in SQL Server
SQL Server can store string values in varchar or nvarchar, where varchar stores each character as 1 byte*, and nvarchar uses 2 bytes (UTF-16) for each character.
As a rule of thumb, you should use nvarchar to store anything that looks like a name, and varchar for anything that looks like a code (currency codes, SWIFT codes, country codes, etc). Codes are often indexed and regularly used to join and aggregate things, so you’ll want to keep those small to manage your storage. Anything else, put it in an nvarchar and forget everything about encoding.
At least, that used to be my blanket recommendation until today. Though I’d argue that most people reading this post will be fine with this advice for many years to come.
Which brings us to the subject of collations
I remember it like yesterday. SQL Server 2000 was the version that broke a lot of my old web code with the introduction “collations”. A collation order in SQL Server
- defines when characters are alike and when they are not – a collation can be
- case-sensitive (“A” is different from “a”),
- accent sensitive (“á” is different from “a”),
- kana-sensitive (“あ” is different from “ア”),
- … and a few other switches
- sets other likeness rules – for example,
- whether “ss” and “ß” are equivalent, or
- whether “æ” and “ae” are equivalent
- determines the sort order of two strings
- defines a codepage for 8-bit characters
When you think about it, collation orders affect a lot of things we do with string values in SQL Server – should we combine two values in a GROUP BY? How do we sort them with an ORDER BY? Are they equal or different? And if you would try to compare or combine two strings of different collations, you’d run into a collation error, because you’re literally trying to compare apples to oranges.
In SQL Server, you can define the collation order on the server, the database, even for individual columns and parameter values.
Introducing UTF-8 collations
Starting with SQL Server 2019, there are a couple of new UTF-8 collations, meaning that you can apply a UTF collation order to a varchar value. Meaning you can store multi-byte characters in regular old varchar columns!
Let’s take it for a spin and see how UTF-8 compares to the old ways:
CREATE TABLE #helloworld (
Plain varchar(100) COLLATE Finnish_Swedish_100_CI_AS NULL,
nPlain nvarchar(100) COLLATE Finnish_Swedish_100_CI_AS NULL,
UTF varchar(100) COLLATE Finnish_Swedish_100_CI_AS_SC_UTF8 NULL,
nUTF nvarchar(100) COLLATE Finnish_Swedish_100_CI_AS_SC_UTF8 NULL
);
INSERT INTO #helloworld
VALUES (N'Hello world', N'Hello world', N'Hello world', N'Hello world'),
(N'Chào thế giới', N'Chào thế giới', N'Chào thế giới', N'Chào thế giới'),
(N'こんにちは世界', N'こんにちは世界', N'こんにちは世界', N'こんにちは世界'),
(N'👋🌍', N'👋🌍', N'👋🌍', N'👋🌍');
I’m prefixing string values in my code with an “N” to tell SQL Server that I’m using unicode strings. This is a pretty good practice to get used to for when you’re working with non-ASCII values. Without the N prefix, SQL Server will implicitly convert the data to varchar, potentially losing valuable information.
Here’s how the data comes out:
SELECT Plain, DATALENGTH(Plain) AS bytes, LEN(Plain) AS [len],
nPlain, DATALENGTH(nPlain) AS bytes, LEN(nPlain) AS [len],
UTF, DATALENGTH(UTF) AS bytes, LEN(UTF) AS [len],
nUTF, DATALENGTH(nUTF) AS bytes, LEN(nUTF) AS [len]
FROM #helloworld;
Using non-ASCII characters in plain varchar fields just converts them to “?” characters, which is expected.
Note how nvarchar completely ignores UTF-8 collation. nvarchar is UTF-16. It still uses the defined collation order – string comparisons, sorting rules, etc – but nvarchars are always stored in two bytes.
The all-ANSI phrase “Hello world” takes up a single byte per character in varchar and two in nvarchar, just like we expected.
The Vienamese text contains some plain-ANSI characters as well as some unicode characters. Plain varchar won’t work here. In nvarchar, it uses two bytes per character, but the UTF-8 varchar uses about 1.4 characters on average!
The Japanese hiragana characters use two bytes per character in nvarchar and three bytes in UTF-8. So that’s worse than using nvarchar.
Finally, the emojis (yes, they literally went with all written languages) use two bytes in nvarchar and four bytes in nvarchar and UTF-8. To be fair, the storage space is 8 bytes across all types, but the definition of “length” varies for some reason.
Pros of UTF-8 collations
- UTF-8 collation can save you some space when the majority of your string values are in 8-bit ASCII, because it dynamically allocates space for longer characters.
Cons of UTF-8 collations
- SQL Server will implicitly convert between varchar and nvarchar, but it won’t implicitly re-collate strings. If you’re going for a UTF collation, you would be wise to make the entire database in the same collation or spend time manually adding COLLATE keywords all over your codebase.
- You could run into the maximum length of the varchar field with UTF-8, even though you’re seemingly inserting a shorter string. If you have a currency field in a char(3) type with a UTF collation, and a malicious user enters TH฿:
DECLARE @t TABLE (
ccy char(3) COLLATE Finnish_Swedish_100_CI_AS_SC_UTF8
)
DECLARE @ccy nvarchar(10)=N'TH฿';
INSERT INTO @t
SELECT @ccy;
Msg 8152, Level 16, State 30, Line 7
String or binary data would be truncated.
Want more?
Erland Sommarskog has a 75-minute recorded sessions about collations, which is well worth watching.
I’ve also had fun with UTF8 this week, another team has sent us a fixed width format text file using UTF8, worked until the first Å
It’s almost as if any technology with the word “variable-something” in it is inherently designed to deliver random results. 🙂
Hello Daniel, very interesting article.
In your definition of UTF-8 you say, “32 bytes” (should be “32 bits”).
Also, some characters in UTF-16 do occupy 32 bits rather than 16 – though it would be correct to say that all the characters most people care about just occupy 16 bits. Many emoji require 32 bits in UTF-16.
When Unicode was first designed, it was assumed that no-one would need to define more than 2 to the power of 16 characters. But then along came a mixed bag of other things, such as ancient scripts (e.g. Linear-B and hieroglyphics), obscure Far Eastern characters, musical notations, and a whole lot of emoji. People realized that an extra 16-bit unit would sometimes be needed in UTF-16, and that UTF-8 could require up to 4 8-bit units. But one thing they did make sure of. That was that you could always tell _from the first unit_ how many more units would be following to make up the whole character.
This about the “length” is a tricky one. My understanding is that this is the number of visible characters in a string, irrespective of the number of bytes occupied by each character. But it’s not hard to think of ambiguities. Is “æ” one character or two? What about an e-acute? And “ß” is one character, yet “ss” is two. Does an emoji for a person become two characters when a skin tone modifier is added? And so on.
Thanks, Mark! I’ll update the post as soon as I have some time. 🙂
Hello Daniel. Nice article. Here are a few notes to help clarify some of the whacky nuances in Unicode and encodings in general:
1. A “character” also includes numbers, punctuation, control codes, various forms of white-space, accents, Dingbats, box drawing, etc. I would liken a “character” to an addressable entity, some of which require being used with other “characters”, but are still a thing unto themselves (e.g. combining characters).
2. “ASCII”, while commonly used to refer to many 8 bit code pages, is actually (sorry for being pedantic here) just a 7 bit encoding, encompassing code points 0 – 127. The 8-bit code pages in SQL Server are Windows code pages for the Windows collations, and Windows and DOS (and some EBCDIC, though that might just be for sorting) code pages for the SQL Server collations. For more info, please start here: https://en.wikipedia.org/wiki/Extended_ASCII
3. I wouldn’t say that there are 3 “flavours” of Unicode, rather there are three encodings (of the same Unicode character set).
4. UTF-8 uses 8-bit units (i.e. 1 byte). Characters use between one and four units / bytes.
5. UTF-16 uses 16-bit (i.e. 2 byte) units and is actually variable-length as characters use either one or two units (i.e. 2 or 4 bytes). The first 65,536 code points (U+0000 through U+FFFF) use a single 16-bit code unit, while the remaining code points (U+10000 and above, i.e. supplementary characters) use two 16-bit code units (i.e. surrogate pairs).
6. *VARCHAR* , when using non-UTF8 collations, stores characters using either 1 or 2 bytes. There are four double-byte character sets that are variable-length and use either 1 or 2 bytes to represent up to 20k (or so) characters.
7. *NVARCHAR* , as stated above, uses either 2 or 4 bytes per each character.
8. Unicode / NVARCHAR was introduced in SQL Server 7.0.
9. In addition to instance, database, and column levels, you can also specify collation per expression via the `COLLATE` keyword.
10. It should be *Unicode* not *UTF* “collation order” since UTF refers to encoding, and the various encodings of Unicode do not impact sorting or comparison.
11. Mutli-byte characters could be stored in VARCHAR starting in SQL Server 2000 via those four double-byte character sets. SQL Server 2019 introduced the ability to store Unicode in VARCHAR.
12. While true that prefixing string literals with an upper-case “N” does indicate that literal to be UTF-16, it should also be noted that if the database in which code is being executed has a UTF-8 default collation, then string literals can contain Unicode characters without needing the “N” prefix (in that case, string literals with the “N” prefix are still UTF-16, and string literals without the “N” prefix are UTF-8; both are Unicode).
13. Regarding “non-ASCII characters in VARCHAR fields converting to ?”, it would be more accurate to say characters not present in the 8-bit code page associate with that VARCHAR field (or literal or variable). For example, the Hebrew character “ש” is valid VARCHAR data when using a Hebrew collation, but will convert to “?” when using a non-Hebrew, non-UTF8 collation.
14. Again, NVARCHAR characters are not always 2 bytes. Some are 4 bytes (in fact, the vast majority of code points require 4 bytes, it’s just that the most commonly used only require 2 bytes in UTF-16).
15. The Japanese string would also take 2 bytes per character, as VARCHAR, if using a Japanese collation (e.g. `Japanese_XJIS_140_CI_AS`). But yes, Unicode is still better.
16. The emojis, as stated above and as shown in your output, use 4 bytes each, regardless of encoding (hence two of them report being 8 bytes in the columns that can contain them). The reason for the variation in what’s reported by the LEN() function has to do with whether or not the collation supports Supplementary Characters (i.e. code points above U+FFFF). Support is denoted either by “_SC” or “_140” in the collation name. Exceptions are: _including_ `Latin1_General_100_BIN2_UTF8` (for VARCHAR data), which doesn’t match the name pattern, and _excluding_ `Japanese_XJIS_140_BIN[2]`, which does match the name pattern). Built-in functions, when using collations that do not support supplementary characters, see both surrogate characters of the surrogate pair (which is how supplementary characters are represented in UTF-16) rather than the single supplementary character.
Take care, Solomon…..
Thank you, Solomon! I appreciate your corrections/additions.