How do I use 3 and 4-byte Unicode characters with standard C++ strings?

问题

In standard C++ we have char and wchar_t for storing characters. char can store values between 0x00 and 0xFF. And wchar_t can store values between 0x0000 and 0xFFFF. std::string uses char, so it can store 1-byte characters only. std::wstring uses wchar_t, so it can store characters up to 2-byte width. This is what I know about strings in C++. Please correct me if I said anything wrong up to this point.

I read the article for UTF-8 in Wikipedia, and I learned that some Unicode characters consume up to 4-byte space. For example, the Chinese character 𤭢 has a Unicode code point 0x24B62, which consumes 3-byte space in the memory.

Is there an STL string container for dealing with these kind of characters? I'm looking for something like std::string32. Also, we had main() for ASCII entry point, wmain() for entry point with 16-bit character support; what entry point do we use for 3 and 4-byte Unicode supported code?

Can you please add a tiny example?

(My OS: Windows 7 x64)

回答1:

First you need a better understanding of Unicode. Specific answers to your questions are at the bottom.

Concepts

You need a more nuanced set of concepts than are required for very simple text handling as taught in introductory programming courses.

byte
code unit
code point
abstract character
user perceived Character

A byte is the smallest addressable unit of memory. Usually 8 bits today, capable of storing up to 256 different values. By definition a char is one byte.

A code unit is the smallest fixed size unit of data used in storing text. When you don't really care about the content of text and you just want to copy it somewhere or calculate how much memory the text is using then you care about code units. Otherwise code units aren't much use.

A code point represents a distinct member of a character set. Whatever 'characters' are in a character set, they all are assigned a unique number, and whenever you see a particular number encoded then you know which member of the character set you're dealing with.

An abstract character is an entity with meaning in a linguistic system, and is distinct from its representation or any code points assigned to that meaning.

User perceived characters are what they sound like; what the user thinks of as a character in whatever linguistic system he's using.

In the old days, char represented all of these things: a char is by definition a byte, in char* strings the code units are chars, the character sets were small so the 256 values representable by char was plenty to represent every member, and the linguistic systems that were supported were simple, so the members of the character sets mostly represented the characters users wanted to use directly.

But this simple system with char representing pretty much everything wasn't enough to support more complex systems.

The first problem encountered was that some languages use far more than 256 characters. So 'wide' characters were introduced. Wide characters still used a single type to represent four of the above concepts, code units, code points, abstract characters, and user perceived characters. However wide characters are no longer single bytes. This was thought to be the simplest method of supporting large character sets.

Code could mostly be the same, except it would deal with wide characters instead of char.

However it turns out that many linguistic systems aren't that simple. In some systems it makes sense not to have every user-perceived character necessarily be represented by a single abstract character in the character set. As a result text using the Unicode character set sometimes represents user perceived characters using multiple abstract characters, or uses a single abstract character to represent multiple user-perceived characters.

Wide characters have another problem. Since they increase the size of the code unit they increase the space used for every character. If one wishes to deal with text that could adequately be represented by single byte code units, but must use a system of wide characters then the amount of memory used is higher than would be the case for single byte code units. As such, it was desired that wide characters not be too wide. At the same time wide characters need to be wide enough to provide a unique value for every member of the character set.

Unicode currently contains about 100,000 abstract characters. This turns out to require wide characters which are wider than most people care to use. As a result a system of wide characters; where code units larger than one byte are used to directly store codepoint values turns out to be undesirable.

So to summarize, originally there was no need to distinguish between bytes, code units, code points, abstract characters, and user perceived characters. Over time, however, it became necessary to distinguish between each of these concepts.

Encodings

Prior to the above, text data was simple to store. Every user perceived character corresponded to an abstract character, which had a code point value. There were few enough characters that 256 values was plenty. So one simply stored the code point numbers corresponding to the desired user-perceived characters directly as bytes. Later, with wide characters, the values corresponding to user-percieved characters were stored directly as integers of larger sizes, 16 bits, for example.

But since storing Unicode text this way would use more memory than people are willing to spend (three or four bytes for every character) Unicode 'encodings' store text not by storing the code point values directly, but by using a reversible function to compute some number of code unit values to store for each code point.

The UTF-8 encoding, for example, can take the most commonly used Unicode code points and represent them using a single, one byte code unit. Less common code points are stored using two one byte code units. Code points that are still less common are stored using three or four code units.

This means that common text can generally be stored with the UTF-8 encoding using less memory than 16-bit wide character schemes, but also that the numbers stored do not necessarily correspond directly to the code point values of abstract characters. Instead if you need to know what abstract characters are stored, you have to 'decode' the stored code units. And if you need to know the user perceived characters you have to further convert abstract characters into user perceived characters.

There are many different encodings, and in order to convert data using those encodings into abstract characters you must know the right method of decoding. The stored values are effectively meaningless if you don't know what encoding was used to convert the code point values into code units.

An important implication of encodings are that you need to know whether particular manipulations of encoded data are valid, or meaningful.

For example, if you want get the 'size' of a string are you counting bytes, code units, abstract characters, or user perceived characters? std::string::size() counts code units, and if you need a different count then you have to use another method.

As another example, if you split an encoded string you need to know if you're doing so in such a way that the result is still valid in that encoding and that the data's meaning hasn't unintentionally changed. For example you might split between code units that belong to the same code point, thus producing an invalid encoding. Or you might split between code points which must be combined to represent a user perceived character and thus produce data the user will see as incorrect.

Answers

Today char and wchar_t can only be considered code units. The fact that char is only one byte doesn't prevent it from representing code points that take two, three, or four bytes. You simply have to use two, three, or four chars in sequence. This is how UTF-8 was intended to work. Likewise, platforms that use two byte wchar_t to represent UTF-16 simply use two wchar_t in a row when necessary. The actual values of char and wchar_t don't individually represent Unicode code points. They represent code unit values that result from encoding the code points. E.g. The Unicode code point U+0400 is encoded into two code units in UTF-8 -> 0xD0 0x80. The Unicode code point U+24B62 similarly gets encoded into as four code units 0xF0 0xA4 0xAD 0xA2.

So you can use std::string to hold UTF-8 encoded data.

On Windows main() supports not just ASCII, but whatever the system char encoding is. Unfortunately Windows doesn't support UTF-8 as the system char encoding the way other platforms do, so you are limited to legacy encodings like cp1252 or whatever your system is configured to use. You can, however, use a Win32 API call to directly access the UTF-16 command line parameters instead of using main()s argc and argv parameters. See GetCommandLineW() and CommandLineToArgvW.

wmain()'s argv parameter fully supports Unicode. The 16-bit code units stored in wchar_t on Windows are UTF-16 code units. The Windows API uses UTF-16 natively, so it's quite easy to work with on Windows. wmain() is non-standard though, so relying on this won't be portable.

回答2:

The size and meaning of wchar_t is implementation-defined. On Windows it's 16 bit as you say, on Unix-like systems it's often 32 bit but not always.

For that matter, a compiler is permitted do its own thing and pick a different size for wchar_t than what the system says -- it just won't be ABI-compatible with the rest of the system.

C++11 provides std::u32string, which is for representing strings of unicode code points. I believe that sufficiently recent Microsoft compilers include it. It's of somewhat limited use since Microsoft's system functions expect 16-bit wide characters (a.k.a UTF-16le), not 32-bit unicode code points (a.k.a UTF-32, UCS-4).

You mention UTF-8, though: UTF-8 encoded data can be stored in a regular std::string. Of course since it's a variable-length encoding, you can't access unicode code points by index, you can only access the bytes by index. But you'd normally write your code not to need to access code points by index anyway, even if using u32string. Unicode code points don't correspond 1-1 with printable characters ("graphemes") because of the existence of combining marks in Unicode, so many of the little tricks you play with strings when learning to program (reversing them, searching for substrings) don't work so easily with Unicode data no matter what you store it in.

The character 𤭢 is, as you say, \u24B62. It is UTF-8 encoded as a series of four bytes, not three: F0 A4 AD A2. Translating between UTF-8 encoded data and unicode code points is effort (admittedly not a huge amount of effort and library functions will do it for you). It's best to regard "encoded data" and "unicode data" as separate things. You can use whatever representation you find most convenient right up to the point where you need to (for example) render the text to screen. At that point you need to (re-)encode it to an encoding that your output destination understands.

回答3:

Windows uses UTF-16. Any code point in the range of U+0000 to U+D7FF and U+E000 to U+FFFF will be stored directly; any outside of those ranges will be split into two 16-bit values according to the UTF-16 encoding rules.

For example 0x24B62 will be encoded as 0xd892,0xdf62.

You may convert the strings to work with them any way you'd like but the Windows API will still want and deliver UTF-16 so that's probably going to be the most convenient.

回答4:

In standard C++ we have char and wchar_t for storing characters? char can store values between 0x00 and 0xFF. And wchar_t can store values between 0x0000 and 0xFFFF

Not quite:

sizeof(char)     == 1   so 1 byte per character.
sizeof(wchar_t)  == ?   Depends on your system 
                        (for unix usually 4 for Windows usually 2).

Unicode characters consume up to 4-byte space.

Not quite. Unicode is not an encoding. Unicode is a standard the defines what each code point is and the code points are restricted to 21 bits. The first 16 bits defined the character position on a code plain while the following 5 bits defines which plain the character is on.

There are several unicode encodings (UTF-8, UTF-16 and UTF-32 being the most common) this is how you store the characters in memory. There are practical differences between the three.

    UTF-8:   Great for storage and transport (as it is compact)
             Bad because it is variable length
    UTF-16:  Horrible in nearly all regards
             It is always large and it is variable length
             (anything not on the BMP needs to be encoded as surrogate pairs)
    UTF-32:  Great for in memory representations as it is fixed size
             Bad because it takes 4 bytes for each character which is usually overkill

Personally I use UTF-8 for transport and storage and UTF-32 for in memory representation of text.

回答5:

char and wchar_t are not the only data types used for text strings. C++11 introduces new char16_t and char32_t data types and respective STL std::u16string and std::u32string typedefs of std::basic_string, to address the ambiquity of the wchar_t type, which has different sizes and encodings on different platforms. wchar_t is 16-bit on some platforms, suitable for UTF-16 encoding, but is 32-bit on other platforms, suitable for UTF-32 encoding instead. char16_t is specifically 16-bit and UTF-16, and char32_t is specifically 32-bit and UTF-32, on all platforms.

来源：https://stackoverflow.com/questions/12643580/how-do-i-use-3-and-4-byte-unicode-characters-with-standard-c-strings

标签

c++

string

stl

stdstring

unicode-string