I have a list of character range restrictions that I need to check a string against, but the char
type in .NET is UTF-16 and therefore some characters become wa
You are asking about code points. In UTF-16 (C#'s char
) there are only two possibilities:
Therefore, assuming the string is valid, this returns an array of code points for a given string:
public static int[] ToCodePoints(string str)
{
if (str == null)
throw new ArgumentNullException("str");
var codePoints = new List(str.Length);
for (int i = 0; i < str.Length; i++)
{
codePoints.Add(Char.ConvertToUtf32(str, i));
if (Char.IsHighSurrogate(str[i]))
i += 1;
}
return codePoints.ToArray();
}
An example with a surrogate pair