What is the difference between UTF-8 and Unicode?

前端 未结 15 1255
独厮守ぢ
独厮守ぢ 2020-11-22 17:08

I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page.

They are the same thing, aren\'t they? Can someone clarify?

15条回答
  •  说谎
    说谎 (楼主)
    2020-11-22 17:21

    "Unicode" is unfortunately used in various different ways, depending on the context. Its most correct use (IMO) is as a coded character set - i.e. a set of characters and a mapping between the characters and integer code points representing them.

    UTF-8 is a character encoding - a way of converting from sequences of bytes to sequences of characters and vice versa. It covers the whole of the Unicode character set. ASCII is encoded as a single byte per character, and other characters take more bytes depending on their exact code point (up to 4 bytes for all currently defined code points, i.e. up to U-0010FFFF, and indeed 4 bytes could cope with up to U-001FFFFF).

    When "Unicode" is used as the name of a character encoding (e.g. as the .NET Encoding.Unicode property) it usually means UTF-16, which encodes most common characters as two bytes. Some platforms (notably .NET and Java) use UTF-16 as their "native" character encoding. This leads to hairy problems if you need to worry about characters which can't be encoded in a single UTF-16 value (they're encoded as "surrogate pairs") - but most developers never worry about this, IME.

    Some references on Unicode:

    • The Unicode consortium web site and in particular the tutorials section
    • Joel's article
    • My own article (.NET-oriented)

提交回复
热议问题