I want to get a substring of a given length say 150. However, I want to make sure I don't cut off the string in between a unicode character.
e.g. see the following code:
Here substr
is an invalid string since the smiley character is cut in half.
Instead I want a function that does as follows:
where substr
For reference, here is how I would do it in Objective-C using rangeOfComposedCharacterSequencesForRange
What is the equivalent code in C#?
This should return the maximal substring starting at index startIndex
and with length up to length
of "complete" graphemes... So initial/final "splitted" surrogate pairs will be removed, initial combining marks will be removed, final characters missing their combining marks will be removed.
Note that probably it isn't what you asked... You seem to want to use graphemes as the unit of measure (or perhaps you want to include the last grapheme even if its length will go over the length
parameter)
public static class StringEx { public static string UnicodeSafeSubstring(this string str, int startIndex, int length) { if (str == null) { throw new ArgumentNullException("str"); } if (startIndex < 0 || startIndex > str.Length) { throw new ArgumentOutOfRangeException("startIndex"); } if (length < 0) { throw new ArgumentOutOfRangeException("length"); } if (startIndex + length > str.Length) { throw new ArgumentOutOfRangeException("length"); } if (length == 0) { return string.Empty; } var sb = new StringBuilder(length); int end = startIndex + length; var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex); while (enumerator.MoveNext()) { string grapheme = enumerator.GetTextElement(); startIndex += grapheme.Length; if (startIndex > length) { break; } // Skip initial Low Surrogates/Combining Marks if (sb.Length == 0) { if (char.IsLowSurrogate(grapheme[0])) { continue; } UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0); if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark) { continue; } } sb.Append(grapheme); if (startIndex == length) { break; } } return sb.ToString(); } }
Variant that will simply include "extra" characters at the end of the substring, if necessary to make whole a grapheme:
public static class StringEx { public static string UnicodeSafeSubstring(this string str, int startIndex, int length) { if (str == null) { throw new ArgumentNullException("str"); } if (startIndex < 0 || startIndex > str.Length) { throw new ArgumentOutOfRangeException("startIndex"); } if (length < 0) { throw new ArgumentOutOfRangeException("length"); } if (startIndex + length > str.Length) { throw new ArgumentOutOfRangeException("length"); } if (length == 0) { return string.Empty; } var sb = new StringBuilder(length); int end = startIndex + length; var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex); while (enumerator.MoveNext()) { if (startIndex >= length) { break; } string grapheme = enumerator.GetTextElement(); startIndex += grapheme.Length; // Skip initial Low Surrogates/Combining Marks if (sb.Length == 0) { if (char.IsLowSurrogate(grapheme[0])) { continue; } UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0); if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark) { continue; } } sb.Append(grapheme); } return sb.ToString(); } }
This will return what you asked
.
Looks like you're looking to split a string on graphemes, that is on single displayed characters.
In that case, you have a handy method: StringInfo.SubstringByTextElements
:
Here is a simple implementation for truncate (startIndex = 0):
string truncatedStr = (str.Length > maxLength) ? str.Substring(0, maxLength - (char.IsLowSurrogate(str[maxLength]) ? 1 : 0)) : str;