Why Java char uses UTF-16?

不想你离开。 提交于 2019-11-28 01:49:49
nj_

Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical:

Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character.

This, and the birth of UTF-16, is further explained by the Unicode FAQ page:

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

As @wero has already mentioned, random access cannot be done efficiently with UTF-8. So all things weighed up, UCS-2 was seemingly the best choice at the time, particularly as the no supplementary characters had been allocated by that stage. This then left UTF-16 as the easiest natural progression beyond that.

One reason are the performance characteristics of random access or iterating over the characters of a String:

UTF-8 encoding uses a variable number (1-4) bytes to encode a unicode char. Therefore accessing a character by index: String.charAt(i) would be way more complicated to implement and slower than the array access used by java.lang.String.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!