In the ¹comp.lang.c++ Usenet group I recently asserted, based on what I thought I knew, that Windows\' 16-bit wchar_t
, with UTF-16 encoding where sometimes two
Let's start from first principles:
(§3.7.3) wide character: bit representation that fits in an object of type wchar_t, capable of representing any character in the current locale
(§3.7) character: 〈abstract〉 member of a set of elements used for the organization, control, or representation of data
That, right away, discards full Unicode as a character set (a set of elements/characters) representable on 16-bit wchar_t
.
But wait, Nicol Bolas quoted the following:
The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.
and then wondered about the behavior for characters outside the execution character set. Well, C99 has the following to say about this issue:
(§5.1.1.2) Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation- defined member other than the null (wide) character.8)
and further clarifies in a footnote that not all source characters need to map to the same execution character.
Armed with this knowledge, you can declare that your wide execution character set is the Basic Multilingual Plane, and that you consider surrogates as proper characters themselves, not as mere surrogates for other characters. AFAICT, this means you are in the clear as far as Clause 6 (Language) of ISO C99 cares.
Of course, don't expect Clause 7 (Library) to play along nicely with you. As an example, consider iswalpha(wint_t)
. You cannot pass astral characters (characters outside the BMP) to that function, you can only pass it the two surrogates. And you'd get some nonsensical result, but that's fine because you declared the surrogate themselves to be proper members of the execution character set.