问题
My question arises from this answer, which says:
Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit.
If that's correct, then why is "ℤ".getBytes(StandardCharsets.UTF_8).length == 3 and "ℤ".getBytes(StandardCharsets.UTF_16).length == 4?
回答1:
It seems you're mixing up two things: the character set (Unicode) and their encoding (UTF-8 or UTF-16).
0x2124 is only the 'sequence number' in the Unicode table. Unicode is nothing more than a bunch of 'sequence numbers' mapped to certain characters. Such a sequence number is called a code point, and it's often written down as a hexadecimal number.
How that certain number is encoded, might take up more bytes than the raw code point would.
Short calculation of UTF-8 encoding of given character:
To know which bytes belong to the same character, UTF-8 uses a system where the first byte starts with a certain amount (lets call it N) of 1 bits followed by a 0 bit. N is the number of bytes the character takes up. The remaining bytes (N – 1) start with bits 10.
Hex 0x2124 = binary 100001 00100100
According to abovementioned rules, this converts to the following UTF-8 encoding:
11100010 10000100 10100100 <-- Our UTF-8 encoded result
^ ^ ^ ^ ^ ^ ^
AaaaBbDd CcDddddd CcDddddd <-- Some notes, explained below
Ais a set of ones followed by a zero, which denote the number of bytes belonging to this character (three1s = three bytes).Bis padding, because otherwise the total number of bits is not divisible by 8.Cis the concatenation bits (each subsequent byte starting with10).Dis the actual bits of our code point.
So indeed, the character ℤ takes up three bytes.
回答2:
Not all characters in the BMP are encoded using two bytes in UTF-8. Characters from U+4016 are encoded using 3 bytes, and from U+38E2E using 4 bytes.
The full table can be found in the Wikipedia article on UTF-8:
https://en.wikipedia.org/wiki/UTF-8
来源:https://stackoverflow.com/questions/40124088/if-%e2%84%a4-is-in-the-bmp-why-isnt-it-encoded-in-2-bytes