How can I convert between hex NCRs and UTF-8 code units?

为君一笑 提交于 2019-12-11 05:18:37

问题


Example: the "large red circle" emoji 🔴 can be displayed in HTML using 🔴 But if I create a text file with that same emoji in it, save the file with UTF-8 encoding, and then examine it with a hex editor, I can see the emoji is represented with these four bytes: F0 9F 94 B4. And that is a very different number.

What's the formula to convert between the two representations? How does one derive 0xF09F94B4 from 0x1F534, and vice versa?


回答1:


1f534 refers to the Unicode code point. In binary it is:

00000001 11110101 00110100

If you take a look at the UTF-8 Bit Distribution you can see how these bits plug in to UTF-8 encoding of the codepoint.

Scalar Value                First Byte  Second Byte Third Byte  Fourth Byte
00000000 0xxxxxxx           0xxxxxxx            
00000yyy yyxxxxxx           110yyyyy    10xxxxxx        
zzzzyyyy yyxxxxxx           1110zzzz    10yyyyyy    10xxxxxx    
000uuuuu zzzzyyyy yyxxxxxx  11110uuu    10uuzzzz    10yyyyyy    10xxxxxx

So you've got:

000uuuuu zzzzyyyy yyxxxxxx as
00000001 11110101 00110100

Plug the bits in:

11110uuu 10uuzzzz 10yyyyyy 10xxxxxx =
11110000 10011111 10010100 10110100

Which in hex is F0 9F 94 B4.


To go the other way, from UTF-8 to code point, you check the most significant bits of the first byte to see how many bytes are used (this should be clear from the table above), then pluck out the relevant bits and put them together.


Bonus lineart:

000uuuuu zzzzyyyy yyxxxxxx as
00000001 11110101 00110100
   │││││ ││││││││ │││││││└────────┐
   │││││ ││││││││ ││││││└────────┐│
   │││││ ││││││││ │││││└────────┐││
   │││││ ││││││││ ││││└────────┐│││
   │││││ ││││││││ │││└────────┐││││
   │││││ ││││││││ ││└────────┐│││││
   │││││ ││││││││ │└─────┐   ││││││
   │││││ ││││││││ └─────┐│   ││││││
   │││││ │││││││└──────┐││   ││││││
   │││││ ││││││└──────┐│││   ││││││
   │││││ │││││└──────┐││││   ││││││
   │││││ ││││└──────┐│││││   ││││││
   │││││ │││└───┐   ││││││   ││││││
   │││││ ││└───┐│   ││││││   ││││││
   │││││ │└───┐││   ││││││   ││││││
   │││││ └───┐│││   ││││││   ││││││
   ││││└────┐││││   ││││││   ││││││
   │││└────┐│││││   ││││││   ││││││
   ││└─┐   ││││││   ││││││   ││││││
   │└─┐│   ││││││   ││││││   ││││││
   └─┐││   ││││││   ││││││   ││││││
11110uuu 10uuzzzz 10yyyyyy 10xxxxxx =
11110000 10011111 10010100 10110100


来源:https://stackoverflow.com/questions/45086505/how-can-i-convert-between-hex-ncrs-and-utf-8-code-units

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!