问题
Example: the "large red circle" emoji 🔴 can be displayed in HTML using 🔴
But if I create a text file with that same emoji in it, save the file with UTF-8 encoding, and then examine it with a hex editor, I can see the emoji is represented with these four bytes: F0 9F 94 B4
. And that is a very different number.
What's the formula to convert between the two representations? How does one derive 0xF09F94B4 from 0x1F534, and vice versa?
回答1:
1f534
refers to the Unicode code point. In binary it is:
00000001 11110101 00110100
If you take a look at the UTF-8 Bit Distribution you can see how these bits plug in to UTF-8 encoding of the codepoint.
Scalar Value First Byte Second Byte Third Byte Fourth Byte
00000000 0xxxxxxx 0xxxxxxx
00000yyy yyxxxxxx 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
So you've got:
000uuuuu zzzzyyyy yyxxxxxx as
00000001 11110101 00110100
Plug the bits in:
11110uuu 10uuzzzz 10yyyyyy 10xxxxxx =
11110000 10011111 10010100 10110100
Which in hex is F0 9F 94 B4
.
To go the other way, from UTF-8 to code point, you check the most significant bits of the first byte to see how many bytes are used (this should be clear from the table above), then pluck out the relevant bits and put them together.
Bonus lineart:
000uuuuu zzzzyyyy yyxxxxxx as
00000001 11110101 00110100
│││││ ││││││││ │││││││└────────┐
│││││ ││││││││ ││││││└────────┐│
│││││ ││││││││ │││││└────────┐││
│││││ ││││││││ ││││└────────┐│││
│││││ ││││││││ │││└────────┐││││
│││││ ││││││││ ││└────────┐│││││
│││││ ││││││││ │└─────┐ ││││││
│││││ ││││││││ └─────┐│ ││││││
│││││ │││││││└──────┐││ ││││││
│││││ ││││││└──────┐│││ ││││││
│││││ │││││└──────┐││││ ││││││
│││││ ││││└──────┐│││││ ││││││
│││││ │││└───┐ ││││││ ││││││
│││││ ││└───┐│ ││││││ ││││││
│││││ │└───┐││ ││││││ ││││││
│││││ └───┐│││ ││││││ ││││││
││││└────┐││││ ││││││ ││││││
│││└────┐│││││ ││││││ ││││││
││└─┐ ││││││ ││││││ ││││││
│└─┐│ ││││││ ││││││ ││││││
└─┐││ ││││││ ││││││ ││││││
11110uuu 10uuzzzz 10yyyyyy 10xxxxxx =
11110000 10011111 10010100 10110100
来源:https://stackoverflow.com/questions/45086505/how-can-i-convert-between-hex-ncrs-and-utf-8-code-units