UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

后端未结

关注

 5  616

别跟我提以往 2021-01-30 05:22

I\'ve managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyo

5条回答

天涯浪人 (楼主)

2021-01-30 06:13
A character encoding consists of a sequence of codes that each look up a symbol from a given character set. Please see this good article on Wikipedia on character encoding.

UTF8 (UCS) uses 1 to 4 bytes for each symbol. Wikipedia gives a good rundown of how the multi-byte rundown works:
- The most significant bit of a single-byte character is always 0.
- The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, and so on.
- The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.
- A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a UTF-8 stream never looks like a UTF-16 stream starting with U+FEFF (Byte-order mark)
The page also shows you a great comparison between the advantages and disadvantages of each character encoding type.

UTF16 (UCS2)

Uses 2 bytes to 4 bytes for each symbol.

UTF32 (UCS4)

uses 4 bytes always for each symbol.

char just means a byte of data and is not an actual encoding. It is not analogous to UTF8/UTF16/ascii. A char* pointer can refer to any type of data and any encoding.

STL:

Both stl's std::wstring and std::string are not designed for variable-length character encodings like UTF-8 and UTF-16.

How to implement:

Take a look at the iconv library. iconv is a powerful character encoding conversion library used by such projects as libxml (XML C parser of Gnome)

Other great resources on character encoding:
- tbray.org's Characters vs. Bytes
- IANA character sets
- www.cs.tut.fi's A tutorial on code issues
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (first mentioned by @Dylan Beattie)
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...