UTF-16 safe substring in C# .NET

匿名 (未验证) 提交于 2019-12-03 01:33:01

问题:

I want to get a substring of a given length say 150. However, I want to make sure I don't cut off the string in between a unicode character.

e.g. see the following code:

Here substr is an invalid string since the smiley character is cut in half.

Instead I want a function that does as follows:

where substr

For reference, here is how I would do it in Objective-C using rangeOfComposedCharacterSequencesForRange

What is the equivalent code in C#?

回答1:

This should return the maximal substring starting at index startIndex and with length up to length of "complete" graphemes... So initial/final "splitted" surrogate pairs will be removed, initial combining marks will be removed, final characters missing their combining marks will be removed.

Note that probably it isn't what you asked... You seem to want to use graphemes as the unit of measure (or perhaps you want to include the last grapheme even if its length will go over the length parameter)

public static class StringEx {     public static string UnicodeSafeSubstring(this string str, int startIndex, int length)     {         if (str == null)         {             throw new ArgumentNullException("str");         }          if (startIndex < 0 || startIndex > str.Length)         {             throw new ArgumentOutOfRangeException("startIndex");         }          if (length < 0)         {             throw new ArgumentOutOfRangeException("length");         }          if (startIndex + length > str.Length)         {             throw new ArgumentOutOfRangeException("length");         }          if (length == 0)         {             return string.Empty;         }          var sb = new StringBuilder(length);          int end = startIndex + length;          var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);          while (enumerator.MoveNext())         {             string grapheme = enumerator.GetTextElement();             startIndex += grapheme.Length;              if (startIndex > length)             {                 break;             }              // Skip initial Low Surrogates/Combining Marks             if (sb.Length == 0)             {                 if (char.IsLowSurrogate(grapheme[0]))                 {                     continue;                 }                  UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0);                  if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)                 {                     continue;                 }             }              sb.Append(grapheme);              if (startIndex == length)             {                 break;             }         }          return sb.ToString();     } } 

Variant that will simply include "extra" characters at the end of the substring, if necessary to make whole a grapheme:

public static class StringEx {     public static string UnicodeSafeSubstring(this string str, int startIndex, int length)     {         if (str == null)         {             throw new ArgumentNullException("str");         }          if (startIndex < 0 || startIndex > str.Length)         {             throw new ArgumentOutOfRangeException("startIndex");         }          if (length < 0)         {             throw new ArgumentOutOfRangeException("length");         }          if (startIndex + length > str.Length)         {             throw new ArgumentOutOfRangeException("length");         }          if (length == 0)         {             return string.Empty;         }          var sb = new StringBuilder(length);          int end = startIndex + length;          var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);          while (enumerator.MoveNext())         {             if (startIndex >= length)             {                 break;             }              string grapheme = enumerator.GetTextElement();             startIndex += grapheme.Length;              // Skip initial Low Surrogates/Combining Marks             if (sb.Length == 0)             {                 if (char.IsLowSurrogate(grapheme[0]))                 {                     continue;                 }                  UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0);                  if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)                 {                     continue;                 }             }              sb.Append(grapheme);         }          return sb.ToString();     } } 

This will return what you asked .



回答2:

Looks like you're looking to split a string on graphemes, that is on single displayed characters.

In that case, you have a handy method: StringInfo.SubstringByTextElements:



回答3:

Here is a simple implementation for truncate (startIndex = 0):

string truncatedStr = (str.Length > maxLength)     ? str.Substring(0, maxLength - (char.IsLowSurrogate(str[maxLength]) ? 1 : 0))     : str; 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!