If I use Java 8's String.codePoints to get an array of int codePoints, is it true that the length of the array is the count of characters?

六月ゝ 毕业季﹏ 提交于 2019-12-19 04:55:17

问题


Given a String string in Java, does string.codePoints().toArray().length reflect the length of the String in terms of the actual characters that a human would find meaningful? In other words, does it smooth over escape characters and other artifacts of encoding?

Edit By "human" I kind of meant "programmer" as I would imagine most programmers would see \r\n as two characters, ESC as one character, etc. But now I see that even the accent marks get atomized so it doesn't matter.


回答1:


No.

For example:

  • Control characters (such as ESC, CR, NL, etcetera) will not be removed. These have distinct codepoints in Unicode.

  • Sequences of spaces, tabs, etc are not combined

  • Discretionary hyphen (http://www.fileformat.info/info/unicode/char/00AD/index.htm) characters are not removed.

  • Unicode combining characters (https://en.wikipedia.org/wiki/Combining_character) are not combined.


Now it is debatable whether some of these might be "actual characters that a human would find meaningful" ... but the overall answer is still No.


You clarified as follows:

By "human" I kind of meant "programmer" as I would imagine most programmers would see \r\n as two characters ...

It is more complicated than that. I am a programmer, and for me it depends on the context whether \r\n are meaningful or not. If I am reading a README file, my brain will treat differences in white space as having no semantic importance. But if I am writing a parser, my code would take whitespace into account ... depending on the language it is intended to parse.




回答2:


Just check the Javadoc of CharSequence for the codePoints() method :

Returns a stream of code point values from this sequence. Any surrogate pairs encountered in the sequence are combined as if by Character.toCodePoint and the result is passed to the stream. Any other code units, including ordinary BMP characters, unpaired surrogates, and undefined code units, are zero-extended to int values which are then passed to the stream. https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints--

And the one in the String classes related to code points to understand what a code point is :

String(int[] codePoints, int offset, int count) Allocates a new String that contains characters from a subarray of the Unicode code point array argument.https://docs.oracle.com/javase/8/docs/api/java/lang/String.html

A code point is an int representing a Unicode code point (https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#unicode) so all characters are included even those non-human-readable.




回答3:


String object.codePoints() returns a stream of characters in Java 8.On which you are calling toArray method,so it will treat each character in a seperate manner and will return number of characters.



来源:https://stackoverflow.com/questions/39123371/if-i-use-java-8s-string-codepoints-to-get-an-array-of-int-codepoints-is-it-tru

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!