How does a file with Chinese characters know how many bytes to use per character?

前端 未结 9 1656
误落风尘
误落风尘 2020-12-13 05:05

I have read Joel\'s article \"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)\" but still don\'

9条回答
  •  醉话见心
    2020-12-13 05:37

    Code points up to 0x7ff is stored as 2 bytes; up to 0xffff as 3 bytes; everything else as 4 bytes. (Technically, up to 0x1fffff, but the highest codepoint allowed in Unicode is 0x10ffff.)

    When decoding, the first byte of the multi-byte sequence is used to determine the number of bytes used to make the sequence:

    1. 110x xxxx => 2-byte sequence
    2. 1110 xxxx => 3-byte sequence
    3. 1111 0xxx => 4-byte sequence

    All subsequent bytes in the sequence must fit the 10xx xxxx pattern.

提交回复
热议问题