Is there any reason to prefer UTF-16 over UTF-8?

后端 未结 7 1635
野性不改
野性不改 2020-12-25 11:39

Examining the attributes of UTF-16 and UTF-8, I can\'t find any reason to prefer UTF-16.

However, checking out Java and C#, it looks like strings and chars there def

7条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-25 12:10

    If we're talking about plain text alone, UTF-16 can be more compact in some languages, Japanese (about 20%) and Chinese (about 40%) being prime examples. As soon as you're comparing HTML documents, the advantage goes completely the other way, since UTF-16 is going to waste a byte for every ASCII character.

    As for simplicity or efficiency: if you implement Unicode correctly in an editor application, complexity will be similar because UTF-16 does not always encode codepoints as a single number anyway, and single codepoints are generally not the right way to segment text.

    Given that in the most common applications, UTF-16 is less compact, and equally complex to implement, the singular reason to prefer UTF-16 over UTF-8 is if you have a completely closed ecosystem where you are regularly storing or transporting plain text entirely in complex writing systems, without compression.

    After compression with zstd or LZMA2, even for 100% Chinese plain text, the advantage is completely wiped out; with gzip the UTF-16 advantage is about 4% on Chinese text with around 3000 unique graphemes.

提交回复
热议问题