What is the encoding of Chinese characters on Wikipedia?

安稳与你 提交于 2019-12-03 09:44:16

>>> c='\xe7\x9a\x84'.decode('utf8')
>>> c
u'\u7684'
>>> print c
的


though Unicode encodes it in 16 bits, utf8 breaks it down to 3 bytes.

The header of a wikipedia page includes this:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

So the page is UTF-8.

The example you give is an IRI.

IRIs use the UTF8 encoding. UTF8 implements unicode, and in unicode, each character has a codepoint, that is between 0x4E00 and 0x9FFF (2 bytes) for all chinese characters.

But UTF8 doesn't encode characters by just storing their codepoint (UTF32 does that). Instead, it uses a more complex standard, that makes all chinese ideograms 2 or 3 bytes long.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!