Why Java char uses UTF-16?

Recently I read lots of thing about unicode code points and how they evolved over time and sure I read http://www.joelonsoftware.com/articles/Unicode.html this also.

But something I couldn't find the real reason why Java uses UTF-16 for a char.

For example If I had the string which contains 1024 letter of ASCII scoped charachter string. It means 1024 * 2 bytes which equals to 2KB string memory it will consume in anyway.

So if Java base char would be UTF-8 it would be just 1KB of data. Even if the string has any charachter which needs to 2bytes for example 10 charachter of "字" naturally it will increase the size of the memory consumption. (1014 * 1 byte) + (10 * 2 bytes) = 1KB + 20 bytes

The result isn't that obvious 1KB + 20 bytes VS. 2KB I don't say about ASCII but my curiosity about this why it is not UTF-8 which is just take cares of multibyte chars also. UTF-16 it looks like waste of memory in any string which has lots of non multibyte chars.

Is there any good reason behind this ?

nj_

Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical:

Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character.

This, and the birth of UTF-16, is further explained by the Unicode FAQ page:

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

As @wero has already mentioned, random access cannot be done efficiently with UTF-8. So all things weighed up, UCS-2 was seemingly the best choice at the time, particularly as the no supplementary characters had been allocated by that stage. This then left UTF-16 as the easiest natural progression beyond that.

One reason are the performance characteristics of random access or iterating over the characters of a String:

UTF-8 encoding uses a variable number (1-4) bytes to encode a unicode char. Therefore accessing a character by index: String.charAt(i) would be way more complicated to implement and slower than the array access used by java.lang.String.

来源：https://stackoverflow.com/questions/36236364/why-java-char-uses-utf-16

标签

java

unicode

utf-8

utf-16