Java charAt used with characters that have two code units

假如想象 提交于 2019-11-27 22:57:56

It sounds like tho book is saying that 'ℤ' is not a UTF-16 character in the basic multilingual plane, but in fact it is.

Java uses UTF-16 with surrogate pairs for characters that are not in the basic multilingual plane. Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit. In your example sentence.charAt(0) will return 'ℤ', and sentence.charAt(1) will return ' '.

A character represented by surrogate pairs has two code units making up the character. sentence.charAt(0) would return the first code unit, and sentence.charAt(1) would return the second code unit.

See http://docs.oracle.com/javase/6/docs/api/java/lang/String.html:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

Jason Sperske

According to the documentation String is represented internally as utf-16, so charAt() is giving you two code points. If you are interested in seeing the individual code points you can use this code (from this answer):

final int length = sentence.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = sentence.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}

The Javadocs Explain this:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

In short, the book is wrong.

Edit to add from comments below: Something I didn't think of last night that was that the character you used in your question isn't actually the one they're talking about, and what they're really getting at is when you have have a character that required four bytes rather than two. The paragraph above in the Javadoc links to another javadoc; Unicode Character Representations which talks about the ramifications of this.

Horstmann was talking about the 'Z' which need two UTF-16 code units. Take a look at this code:

public class Main {
    public static void main(String[] args)
    {
        String a = "\uD83D\uDE02 is String";
        System.out.println("Length: " + a.length());
        System.out.println(a.charAt(0));
        System.out.println(a.charAt(1));
        System.out.println(a.charAt(2));
        System.out.println(a.charAt(3));
    }
}

in IntelliJ Idea I can't even paste the 4 byte character as one character because while pasting this emoji: 😂 IDE automatically converts it to: "\uD83D\uDE02". Notice that this emoji is counted as 2 characters.

If you want to count the 'real length' then should use: System.out.println("Real length: " + a.codePointCount(0, a.length()));

Take a look at: What are the most common non-BMP Unicode characters in actual use?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!