encodings - different result between codePointCount and length

倾然丶 夕夏残阳落幕 提交于 2021-02-08 04:44:49

问题


I found one tricky place and couldn't find any answer why this exactly happen.

The main problem is how long is string.

Whether it contains one or two character.

Code:

public class App {
    public static void main(String[] args) throws Exception {
        char ch0 = 55378;
        char ch1 = 56816;
        String str = new String(new char[]{ch0, ch1});
        System.out.println(str);
        System.out.println(str.length());
        System.out.println(str.codePointCount(0, 2));
        System.out.println(str.charAt(0));
        System.out.println(str.charAt(1));
    }
}

Output:

?
2
1
?
?

Any suggestions?


回答1:


Whether it contains one or two character.

It contains one Unicode character, which is comprised of 2 UTF-16 code units. Every char in Java is a UTF-16 code unit... it may not be a whole character. Each character has a single code point - Unicode provides a coded character set mapping each character to an integer representing that character (the code point).

length() returns the number of code units, whereas codePointCount returns the number of code points.

You may want to look at my article about encodings in .NET - the terminology all translates fine (as it's standard terminology), so just ignore the .NET-specific parts.



来源:https://stackoverflow.com/questions/20162239/encodings-different-result-between-codepointcount-and-length

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!