How does a file with Chinese characters know how many bytes to use per character?

前端未结

关注

 9  1656

误落风尘 2020-12-13 05:05

I have read Joel\'s article \"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)\" but still don\'

9条回答

醉话见心 (楼主)

2020-12-13 05:37
Code points up to 0x7ff is stored as 2 bytes; up to 0xffff as 3 bytes; everything else as 4 bytes. (Technically, up to 0x1fffff, but the highest codepoint allowed in Unicode is 0x10ffff.)

When decoding, the first byte of the multi-byte sequence is used to determine the number of bytes used to make the sequence:
1. 110x xxxx => 2-byte sequence
2. 1110 xxxx => 3-byte sequence
3. 1111 0xxx => 4-byte sequence
All subsequent bytes in the sequence must fit the 10xx xxxx pattern.
0 讨论(0)

查看其它9个回答
发布评论:

提交评论
- 加载中...