In the ¹comp.lang.c++ Usenet group I recently asserted, based on what I thought I knew, that Windows\' 16-bit wchar_t
, with UTF-16 encoding where sometimes two
After clarifying what the question is I will do an edit.
Q: Is the width of 16 bits for wchar_t in Windows conformant to the standard?
A: Well, lets see. We will start with the definition of wchar_t from c99 draft.
... largest extended character set specified among the supported locales.
So, we should look what are the supported locales. For that there are Three steps:
We quickly open the documentation for the locale string. We see the format of the string
locale :: "locale_name"
| "language[_country_region[.code_page]]"
| ".code_page"
| "C"
| ""
| NULL
We see the list of supported Code pages and we see UTF-8, UTF-16, UTF-32 and what not. We're in a dead end.
If we start with the C99 definition, it ends with
... corresponds to a member of the extended character set.
The word "character set" is used. But if we say UTF-16 code units are our character set, then all is OK. Otherwise, it's not. It's kinda vague, and one should not care much. The standards were defined many years ago, when Unicode was not popular.
At the end of the day, we now have C++11 and C11 that define use cases for UTF-8, 16 and 32 with the additional types char16_t and char32_t.
You need to read about Unicode and you will answer the question yourself.
Unicode is a character set. Set of characters, it's about 200000 characters. Or more precisely it is a mapping, mapping between numbers and characters. Unicode by itself does not mean this or that bit width.
Then there are 4 encodings, UTF-7, UTF-8, UTF-16 and UTF-32. UTF stands for Unicode transformation format. Each format defines a code point and a code unit. Code point is an actual charter from Unicode and can consists of one or more units. Only UTF-32 has one unit per point.
On the other hand, each unit fits into a fixed size integer. So UTF-7 units are at most 7 bits, UTF-16 units are at most 16 bits etc.
Therefore, in a 16 bit wchar_t string we can hold Unicode text encoded in UTF-16. Particularly in UTF-16 each point takes one or two units.
So the final answer, in a single wchar_t you can not store all Unicode char, only the single unit ones, but in a string of wchar_t you can store any Unicode text.