Why does .net uses the UTF16 encoding for string , but uses utf8 as default for saving files?

断了今生、忘了曾经 提交于 2019-12-17 15:32:06

问题


From here

Essentially, string uses the UTF-16 character encoding form

But when saving vs StreamWriter :

This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM),

I've seen this sample (broken link removed):

And it looks like utf8 is smaller for some strings while utf-16 is smaller in some other strings.

  • So Why .net uses utf16 as default encoding for string while utf8 for saving file ?

Thank you.

p.s. Ive already read the famous article


回答1:


If you're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.

Consider the primitive type char. If we use UTF-8 as the in-memory representation and want to cope with all Unicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!

Of course, we could use UTF-32 as the char representation, but UTF-8 in the string representation, converting as we go.

The two disadvantages of UTF-16 are:

  • The number of code units per Unicode character is variable, because not all characters are in the BMP. Until emoji became popular, this didn't affect many apps in day-to-day use. These days, certainly for messaging apps and the like, developers using UTF-16 really need to know about surrogate pairs.
  • For plain ASCII (which a lot of text is, at least in the west) it takes twice the space of the equivalent UTF-8 encoded text.

(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)

Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)




回答2:


As with many "why was this chosen" questions, this was determined by history. Windows became a Unicode operating system at its core in 1993. Back then, Unicode still only had a code space of 65535 codepoints, these days called UCS. It wasn't until 1996 until Unicode acquired the supplementary planes to extend the coding space to a million codepoints. And surrogate pairs to fit them into a 16-bit encoding, thus setting the utf-16 standard.

.NET strings are utf-16 because that's an excellent fit with the operating system encoding, no conversion is required.

The history of utf-8 is murkier. Definitely past Windows NT, RFC-3629 dates from November 1993. It took a while to gain a foot-hold, the Internet was instrumental.




回答3:


UTF-8 is the default for text storage and transfer because it is a relatively compact form for most languages (some languages are more compact in UTF-16 than in UTF-8). Each specific language has a more efficient encoding.

UTF-16 is used for in-memory strings because it is faster per character to parse and maps directly to unicode character class and other tables. All string functions in Windows use UTF-16 and have for years.



来源:https://stackoverflow.com/questions/14942092/why-does-net-uses-the-utf16-encoding-for-string-but-uses-utf8-as-default-for

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!