How can I return the Unicode Code Point of a character? For example, if the input is \"A\", then the output should be \"U+0041\". Ideally, a solution should take care of sur
C# cannot store unicode codepoints in a char, as char is only 2 bytes and unicode codepoints routinely exceed that length. The solution is to either represent a codepoint as a sequence of bytes (either as a byte array or "flattened" into a 32-bit primitive) or as a string. The accepted answer converts to UTF32, but that's not always ideal.
This is the code we use to split a string into its unicode codepoint components, but preserving the native UTF-16 encoding. The result is an enumerable that can be used to compare (sub)strings natively in C#/.NET:
public class InvalidEncodingException : System.Exception
{ }
public static IEnumerable UnicodeCodepoints(this string s)
{
for (int i = 0; i < s.Length; ++i)
{
if (Char.IsSurrogate(s[i]))
{
if (s.Length < i + 2)
{
throw new InvalidEncodingException();
}
yield return string.Format("{0}{1}", s[i], s[++i]);
}
else
{
yield return string.Format("{0}", s[i]);
}
}
}
}