Is UTF-16 compatible with UTF-8?

蹲街弑〆低调 提交于 2019-12-01 14:43:42

It's not clear what you mean by "compatible", so let's get some basics out of the way.

Unicode is the underlying concept, and properly implemented, UTF-16 and UTF-8 are two different ways to encode Unicode. They are obviously different -- otherwise, why would there be two different concepts?

Unicode by itself does not specify a serialization format. UTF-8 and UTF-16 are two alternative serialization formats.

They are "compatible" in the sense that they can represent the same Unicode code points, but "incompatible" in that the representations are completely different.

There are two additional twists with UTF-16. There are actually two different encodings, UTF-16LE and UTF-16BE. These differ in endianness. (UTF-8 is a byte encoding, so does not have endianness.) Legacy UTF-16 used to be restricted to 65,536 possible characters, which is less than Unicode currently contains. This is handled with surrogates, but really old and/or broken UTF-16 implementations (properly identified as UCS-2, not "real" UTF-16) do not support them.

For a bit of concretion, let's compare four different code points. We pick U+0041, U+00E5, U+201C, and U+1F4A9, as they illustrate the differences nicely.

U+0041 is a 7-bit character, so UTF-8 represents it simply with a single byte. U+00E5 is an 8-bit character, so UTF-8 needs to encode it. U+1F4A9 is outside the Basic Multilingual Plane, so UTF-16 represents it with a surrogate sequence. Finally, U+201C is none of the above.

Here are the representations of our candidate characters in UTF-8, UTF-16LE, and UTF-16BE.

Character | UTF-8               | UTF-16LE            | UTF-16BE            |
----------+---------------------+---------------------+---------------------+
U+0041    | 0x41                | 0x41 0x00           | 0x00 0x41           |
U+00E5    | 0xC3 0xA5           | 0xE5 0x00           | 0x00 0xE5           |
U+201C    | 0xE2 0x80 0x9C      | 0x1C 0x20           | 0x20 0x1C           |
U+1F4A9   | 0xF0 0x9F 0x92 0xA9 | 0x3D 0xD8 0xA9 0xDC | 0xD8 0x3D 0xDC 0xA9 |

To pick one obvious example, the UTF-8 encoding of U+00E5 would represent a completely different character if interpreted as UTF-16 (in UTF-16LE, it would be U+A5C3, and in UTF-16BE, U+C3A5.) Conversely, many of the UTF-16 codes are not valid UTF-8 sequences at all. So in this sense, UTF-8 and UTF-16 are completely and utterly incompatible.

These are byte values; in ASCII, 0x00 is the NUL character (sometimes represented as ^@), 0x41 is uppercase A, and 0xE5 is undefined; in e.g. Latin-1 in represents the character å (which is also conveniently U+00E5 in Unicode), but in KOI8-R it is the Cyrillic character Е (U+0415), etc.

In modern programming languages, your code should simply use Unicode, and let the language handle the nitty-gritty of encoding it in a way which is suitable for your platform and libraries. On a somewhat tangential note, see also http://utf8everywhere.org/

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!