问题
Following a related question, I\'d like to ask about the new character and string literal types in C++11. It seems that we now have four sorts of characters and five sorts of string literals. The character types:
char a = \'\\x30\'; // character, no semantics
wchar_t b = L\'\\xFFEF\'; // wide character, no semantics
char16_t c = u\'\\u00F6\'; // 16-bit, assumed UTF16?
char32_t d = U\'\\U0010FFFF\'; // 32-bit, assumed UCS-4
And the string literals:
char A[] = \"Hello\\x0A\"; // byte string, \"narrow encoding\"
wchar_t B[] = L\"Hell\\xF6\\x0A\"; // wide string, impl-def\'d encoding
char16_t C[] = u\"Hell\\u00F6\"; // (1)
char32_t D[] = U\"Hell\\U000000F6\\U0010FFFF\"; // (2)
auto E[] = u8\"\\u00F6\\U0010FFFF\"; // (3)
The question is this: Are the \\x
/\\u
/\\U
character references freely combinable with all string types? Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to \\x
/\\u
/\\U
references get expanded into a variable number of bytes? Do u\"\"
and u8\"\"
strings have encoding semantics, e.g. can I say char16_t x[] = u\"\\U0010FFFF\"
, and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence? And similarly for u8
? In (1), can I write lone surrogates with \\u
? Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?
This is a bit of an open-ended question, but I\'d like to get as complete a picture as possible of the new UTF-encoding and type facilities of the new C++11.
回答1:
Are the \x/\u/\U character references freely combinable with all string types?
No. \x
can be used in anything, but \u
and \U
can only be used in strings that are specifically UTF-encoded. However, for any UTF-encoded string, \u
and \U
can be used as you see fit.
Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to \x/\u/\U references get expanded into a variable number of bytes?
Not in the way you mean. \x
, \u
, and \U
are converted based on the string encoding. The number of those "code units" (using Unicode terms. A char16_t
is a UTF-16 code unit) values depends on the encoding of the containing string. The literal u8"\u1024"
would create a string containing 2 char
s plus a null terminator. The literal u"\u1024"
would create a string containing 1 char16_t
plus a null terminator.
The number of code units used is based on the Unicode encoding.
Do u"" and u8"" strings have encoding semantics, e.g. can I say char16_t x[] = u"\U0010FFFF", and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence?
u""
creates a UTF-16 encoded string. u8""
creates a UTF-8 encoded string. They will be encoded per the Unicode specification.
In (1), can I write lone surrogates with \u?
Absolutely not. The specification expressly forbids using the UTF-16 surrogate pairs (0xD800-0xDFFF) as codepoints for \u
or \U
.
Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?
Absolutely not. Well, allow me to rephrase that.
std::basic_string
doesn't deal with Unicode encodings. They certainly can store UTF-encoded strings. But they can only think of them as sequences of char
, char16_t
, or char32_t
; they can't think of them as a sequence of Unicode codepoints that are encoded with a particular mechanism. basic_string::length()
will return the number of code units, not code points. And obviously, the C standard library string functions are totally useless
It should be noted however that "length" for a Unicode string does not mean the number of codepoints. Some code points are combining "characters" (an unfortunate name), which combine with the previous codepoint. So multiple codepoints can map to a single visual character.
Iostreams can in fact read/write Unicode-encoded values. To do so, you will have to use a locale to specify the encoding and properly imbue it into the various places. This is easier said than done, and I don't have any code on me to show you how.
来源:https://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c11