UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

后端 未结 5 616
别跟我提以往
别跟我提以往 2021-01-30 05:22

I\'ve managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyo

5条回答
  •  天涯浪人
    2021-01-30 06:13

    A character encoding consists of a sequence of codes that each look up a symbol from a given character set. Please see this good article on Wikipedia on character encoding.

    UTF8 (UCS) uses 1 to 4 bytes for each symbol. Wikipedia gives a good rundown of how the multi-byte rundown works:

    • The most significant bit of a single-byte character is always 0.
    • The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, and so on.
    • The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.
    • A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a UTF-8 stream never looks like a UTF-16 stream starting with U+FEFF (Byte-order mark)

    The page also shows you a great comparison between the advantages and disadvantages of each character encoding type.

    UTF16 (UCS2)

    Uses 2 bytes to 4 bytes for each symbol.

    UTF32 (UCS4)

    uses 4 bytes always for each symbol.

    char just means a byte of data and is not an actual encoding. It is not analogous to UTF8/UTF16/ascii. A char* pointer can refer to any type of data and any encoding.

    STL:

    Both stl's std::wstring and std::string are not designed for variable-length character encodings like UTF-8 and UTF-16.

    How to implement:

    Take a look at the iconv library. iconv is a powerful character encoding conversion library used by such projects as libxml (XML C parser of Gnome)

    Other great resources on character encoding:

    • tbray.org's Characters vs. Bytes
    • IANA character sets
    • www.cs.tut.fi's A tutorial on code issues
    • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (first mentioned by @Dylan Beattie)

提交回复
热议问题