In standard C++ we have char
and wchar_t
for storing characters. char
can store values between 0x00 and 0xFF. And
The size and meaning of wchar_t
is implementation-defined. On Windows it's 16 bit as you say, on Unix-like systems it's often 32 bit but not always.
For that matter, a compiler is permitted do its own thing and pick a different size for wchar_t
than what the system says -- it just won't be ABI-compatible with the rest of the system.
C++11 provides std::u32string
, which is for representing strings of unicode code points. I believe that sufficiently recent Microsoft compilers include it. It's of somewhat limited use since Microsoft's system functions expect 16-bit wide characters (a.k.a UTF-16le), not 32-bit unicode code points (a.k.a UTF-32, UCS-4).
You mention UTF-8, though: UTF-8 encoded data can be stored in a regular std::string
. Of course since it's a variable-length encoding, you can't access unicode code points by index, you can only access the bytes by index. But you'd normally write your code not to need to access code points by index anyway, even if using u32string
. Unicode code points don't correspond 1-1 with printable characters ("graphemes") because of the existence of combining marks in Unicode, so many of the little tricks you play with strings when learning to program (reversing them, searching for substrings) don't work so easily with Unicode data no matter what you store it in.
The character