What does the .NET String.Length property return? Surrogate neutral length or complete character length

后端 未结 3 1960
悲哀的现实
悲哀的现实 2020-12-14 01:56

The documentation and language varies between VS 2008 and 2010:


VS 2008 Documentation

Internally, the text is stored as a read

相关标签:
3条回答
  • 2020-12-14 02:17

    String.Length does not account for surrogate pairs, it only counts UTF-16 chars (i.e. chars are always 2 bytes) - surrogate pairs are counted as 2 chars.

    0 讨论(0)
  • 2020-12-14 02:26

    Both i would consider false. The second question would be true if you'd ask about the count of unicode codepoints but you asked about "length". The String's Length is the count of its elements which are words. Just in case that there are only unicode codepoints from the BMP (Basic Multilingual Plane) within the string, the length is equal to the number of unicode characters/codepoints. If there are codepoints from beyond the BMP or orphaned surrogates (high- or low-surrogates that do not appear as ordered pair) the length is NOT equal to the number of characters/codepoints.

    First of all, the String is a bunch of words, a word list, word array or word stream. Its content are 16 bit words and that's it. To name an element "char" or "wchar" is a sin regarding unicode characters. Because a unicode character can have a codepoint greater than 0xFFFF it cannot be stored in a type that is 16 bits wide and if this type is called char or wchar it's even worse because it can only ever hold codepoints limited to 0xFFFF which accords to the unicode 1.0 standard which nowerdays is 20 years old. In order to store even the highest possible unicode codepoint in a single datatype, this type should have 21 bits but there is no such type, so we'd use a 32 bit type. In fact there is a static method (of the char class !) that is named ConvertToUtf32() which does just this, it can return a low ASCII codepoint or even the highest unicode codepoint whereby the latter implies that this method can detect a surrogate pair within the position of a String.

    0 讨论(0)
  • 2020-12-14 02:27

    String.Length does not account for surrogate pairs; however, the StringInfo.LengthInTextElements method does.

    StringInfo.SubstringByTextElements is similar to String.Substring, but it operates on "Text Elements", such as surrogate pairs and combining characters, as well as normal characters. The functionality of both these methods are based on the StringInfo.ParseCombiningCharacters method, which extracts the starting index of each text element and stores them in a private array.

    "The .NET Framework defines a text element as a unit of text that is displayed as a single character, that is, a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence." - http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx

    0 讨论(0)
提交回复
热议问题