Why does Java use UTF-16 for the internal text representation

不想你离开。 提交于 2019-12-24 13:35:07

问题


Java uses UTF-16 for the internal text representation. But why? UTF-8 as it seems to me is more flexible.

From wiki:

UTF-8 requires either 8, 16, 24 or 32 bits (one to four octets) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character.


回答1:


Java was designed and first implemented back in the days when Unicode was specified to be a set of 16 bit code-points. That is why char is a 16 bit type, and why String is modeled as a sequence of char.

Now, if the Java designers had been able to foresee that Unicode would add extra "code planes", they might1 have opted for a 32 bit char type.

Java 1.0 came out in January 1996. Unicode 2.0 (which introduced the higher code planes and the surrogate mechanism) was released in July 1996.


Internally, I believe that some versions of Java have used UTF-8 as the representation for strings, at least at some level. However, it is still necessary to map this to the methods specified in the String API because that is what Java applications require. Doing that if the primary internal representation is UTF-8 rather than UTF-16 is going to be inefficient.

And before you suggest that they should "just change the String APIs" ... consider how many trillions of lines of Java code already exist that depend on the current String APIs.


For what it is worth, most if not all programming languages that support Unicode do it via a 16 bit char or wchar type.


1 - ... and possibly not, bearing in mind that memory was a lot more expensive in those days, and programmers worried much more about such things in those days.



来源:https://stackoverflow.com/questions/33194496/why-does-java-use-utf-16-for-the-internal-text-representation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!