I\'ve managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyo
A character encoding consists of a sequence of codes that each look up a symbol from a given character set. Please see this good article on Wikipedia on character encoding.
UTF8 (UCS) uses 1 to 4 bytes for each symbol. Wikipedia gives a good rundown of how the multi-byte rundown works:
- The most significant bit of a single-byte character is always 0.
- The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, and so on.
- The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.
- A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a UTF-8 stream never looks like a UTF-16 stream starting with U+FEFF (Byte-order mark)
The page also shows you a great comparison between the advantages and disadvantages of each character encoding type.
UTF16 (UCS2)
Uses 2 bytes to 4 bytes for each symbol.
UTF32 (UCS4)
uses 4 bytes always for each symbol.
char just means a byte of data and is not an actual encoding. It is not analogous to UTF8/UTF16/ascii. A char* pointer can refer to any type of data and any encoding.
STL:
Both stl's std::wstring and std::string are not designed for variable-length character encodings like UTF-8 and UTF-16.
How to implement:
Take a look at the iconv library. iconv is a powerful character encoding conversion library used by such projects as libxml (XML C parser of Gnome)
Other great resources on character encoding: