If 'ℤ' is in the BMP, why isn't it encoded in 2 bytes?

扶醉桌前 提交于 2020-02-24 17:01:06

问题


My question arises from this answer, which says:

Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit.

If that's correct, then why is "ℤ".getBytes(StandardCharsets.UTF_8).length == 3 and "ℤ".getBytes(StandardCharsets.UTF_16).length == 4?


回答1:


It seems you're mixing up two things: the character set (Unicode) and their encoding (UTF-8 or UTF-16).

0x2124 is only the 'sequence number' in the Unicode table. Unicode is nothing more than a bunch of 'sequence numbers' mapped to certain characters. Such a sequence number is called a code point, and it's often written down as a hexadecimal number.

How that certain number is encoded, might take up more bytes than the raw code point would.


Short calculation of UTF-8 encoding of given character:
To know which bytes belong to the same character, UTF-8 uses a system where the first byte starts with a certain amount (lets call it N) of 1 bits followed by a 0 bit. N is the number of bytes the character takes up. The remaining bytes (N – 1) start with bits 10.

Hex 0x2124 = binary 100001 00100100

According to abovementioned rules, this converts to the following UTF-8 encoding:

11100010 10000100 10100100    <-- Our UTF-8 encoded result
^   ^ ^  ^ ^      ^ ^
AaaaBbDd CcDddddd CcDddddd    <-- Some notes, explained below
  • A is a set of ones followed by a zero, which denote the number of bytes belonging to this character (three 1s = three bytes).
  • B is padding, because otherwise the total number of bits is not divisible by 8.
  • C is the concatenation bits (each subsequent byte starting with 10).
  • D is the actual bits of our code point.

So indeed, the character ℤ takes up three bytes.




回答2:


Not all characters in the BMP are encoded using two bytes in UTF-8. Characters from U+4016 are encoded using 3 bytes, and from U+38E2E using 4 bytes.

The full table can be found in the Wikipedia article on UTF-8:

https://en.wikipedia.org/wiki/UTF-8



来源:https://stackoverflow.com/questions/40124088/if-%e2%84%a4-is-in-the-bmp-why-isnt-it-encoded-in-2-bytes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!