Clarifying Java's evolutionary support of Unicode [closed]

问题

I'm finding Java's differentiation of char and codepoint to be strange and out of place.

For example, a string is an array of characters or "letters which appear in an alphabet"; in contrast to codepoint which MAY be a single letter or possibly a composite or surrogate pair. However, Java defines a character of a string as a char which cannot be composite or contain a surrogate the codepoint and as an int (this is fine).

But then length() seems to return the number of codepoints while codePointCount() also returns the number of codepoints but instead combines composite characters.. which ends up not really being the real count of codepoints?

It feels as though charAt() should return a String so that composites and surrogates are brought along and the result of length() should swap with codePointCount().

The original implementation feels a little backwards. Is there a reason for the way it's designed the way it is?

Update: codePointAt(), codePointBefore()

It's also worth noting that codePointAt() and codePointBefore() accept an index as a parameter, however, the index acts upon chars and has a range of 0 to length() - 1 and is therefore not based on the number of codepoints in the string, as one might assume.

Update: equalsIgnoreCase()

String.equalsIgnoreCase() uses the term normalization to describe what it does prior to comparing strings. This is a misnomer as normalization in the context of a Unicode string can mean something entirely different. What they mean to say is that they use case-folding.

回答1:

When java was created Unicode didn't have the notion of surrogate characters and java decided to represent characters as 16bit values.

I suppose they don't want to break backwards compatibility. There is a lot more information here: http://www.oracle.com/us/technologies/java/supplementary-142654.html

来源：https://stackoverflow.com/questions/34984271/clarifying-javas-evolutionary-support-of-unicode

标签

java

string

unicode

unicode-string