Is there any reason to prefer UTF-16 over UTF-8?

后端 未结 7 1629
野性不改
野性不改 2020-12-25 11:39

Examining the attributes of UTF-16 and UTF-8, I can\'t find any reason to prefer UTF-16.

However, checking out Java and C#, it looks like strings and chars there def

相关标签:
7条回答
  • 2020-12-25 12:09

    East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required).

    Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there's a lot of markup) it's much of a muchness.

    Processing of UTF-16 for user-mode applications is slightly easier than processing UTF-8, because surrogate pairs behave in almost the same way that combining characters behave. So UTF-16 can usually be processed as a fixed-size encoding.

    0 讨论(0)
  • 2020-12-25 12:10

    If we're talking about plain text alone, UTF-16 can be more compact in some languages, Japanese (about 20%) and Chinese (about 40%) being prime examples. As soon as you're comparing HTML documents, the advantage goes completely the other way, since UTF-16 is going to waste a byte for every ASCII character.

    As for simplicity or efficiency: if you implement Unicode correctly in an editor application, complexity will be similar because UTF-16 does not always encode codepoints as a single number anyway, and single codepoints are generally not the right way to segment text.

    Given that in the most common applications, UTF-16 is less compact, and equally complex to implement, the singular reason to prefer UTF-16 over UTF-8 is if you have a completely closed ecosystem where you are regularly storing or transporting plain text entirely in complex writing systems, without compression.

    After compression with zstd or LZMA2, even for 100% Chinese plain text, the advantage is completely wiped out; with gzip the UTF-16 advantage is about 4% on Chinese text with around 3000 unique graphemes.

    0 讨论(0)
  • 2020-12-25 12:16

    For many (most?) applications, you will be dealing only with characters in the Basic Multilingual Plane, so can treat UTF-16 as a fixed-length encoding.

    So you avoid all the complexity of variable-length encodings like UTF-8.

    0 讨论(0)
  • 2020-12-25 12:23

    @Oak: this too long for a comment...

    I don't know about C# (and would be really surprised: it would mean they just copied Java too much) but for Java it's simple: Java was conceived before Unicode 3.1 came out.

    Hence there were less than 65537 codepoints, hence every Unicode codepoint was still fitting on 16-bit and so the Java char was born.

    Of course this led to crazy issues that are still affecting Java programmers (like me) today, where you have a method charAt which in some case does return neither a Unicode character nor a Unicode codepoint and a method (added in Java 5) codePointAt which takes an argument which is not the number of codepoints you want you want to skip! (you have to supply to codePointAt the number of Java char you want to skip, which makes it one of the least understood method in the String class).

    So, yup, this is definitely wild and confusing most Java programmers (most aren't even aware of these issues) and, yup, it is for historical reason. At least, that was the excuse that came up with when people got mad after this issue: but it's because Unicode 3.1 wasn't out yet.

    :)

    0 讨论(0)
  • 2020-12-25 12:30

    It depends on the expected character sets. If you expect heavy use of Unicode code points outside of the 7-bit ASCII range then you might find that UTF-16 will be more compact than UTF-8, since some UTF-8 sequences are more than two bytes long.

    Also, for efficiency reasons, Java and C# does not take surrogate pairs into account when indexing strings. This would break down completely when using code points that are represented with UTF-8 sequences that take up an odd number of bytes.

    0 讨论(0)
  • 2020-12-25 12:31

    UTF-16 can be more efficient for representing characters in some languages such as Chinese, Japanese and Korean where most characters can be represented in one 16 bit word. Some rarely used characters may require two 16 bit words. UTF-8 is generally much more efficient for representing characters from Western European character sets - UTF-8 and ASCII are equivalent over the ASCII range (0-127) - but less efficient with Asian languages, requiring three or four bytes to represent characters that can be represented with two bytes in UTF-16.

    UTF-16 has an advantage as an in-memory format for Java/C# in that every character in the Basic Multilingual Plane can be represented in 16 bits (see Joe's answer) and some of the disadvantages of UTF-16 (e.g. confusing code relying on \0 terminators) are less relevant.

    0 讨论(0)
提交回复
热议问题