If 'ℤ' is in the BMP, why isn't it encoded in 2 bytes?

问题

My question arises from this answer, which says:

Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit.

If that's correct, then why is "ℤ".getBytes(StandardCharsets.UTF_8).length == 3 and "ℤ".getBytes(StandardCharsets.UTF_16).length == 4?

回答1:

It seems you're mixing up two things: the character set (Unicode) and their encoding (UTF-8 or UTF-16).

0x2124 is only the 'sequence number' in the Unicode table. Unicode is nothing more than a bunch of 'sequence numbers' mapped to certain characters. Such a sequence number is called a code point, and it's often written down as a hexadecimal number.

How that certain number is encoded, might take up more bytes than the raw code point would.

Short calculation of UTF-8 encoding of given character:
To know which bytes belong to the same character, UTF-8 uses a system where the first byte starts with a certain amount (lets call it N) of 1 bits followed by a 0 bit. N is the number of bytes the character takes up. The remaining bytes (N – 1) start with bits 10.

Hex 0x2124 = binary 100001 00100100

According to abovementioned rules, this converts to the following UTF-8 encoding:

11100010 10000100 10100100    <-- Our UTF-8 encoded result
^   ^ ^  ^ ^      ^ ^
AaaaBbDd CcDddddd CcDddddd    <-- Some notes, explained below

A is a set of ones followed by a zero, which denote the number of bytes belonging to this character (three 1s = three bytes).
B is padding, because otherwise the total number of bits is not divisible by 8.
C is the concatenation bits (each subsequent byte starting with 10).
D is the actual bits of our code point.

So indeed, the character ℤ takes up three bytes.

回答2:

Not all characters in the BMP are encoded using two bytes in UTF-8. Characters from U+4016 are encoded using 3 bytes, and from U+38E2E using 4 bytes.

The full table can be found in the Wikipedia article on UTF-8:

https://en.wikipedia.org/wiki/UTF-8

来源：https://stackoverflow.com/questions/40124088/if-%e2%84%a4-is-in-the-bmp-why-isnt-it-encoded-in-2-bytes

标签

java

unicode