Return code point of characters in C#

后端 未结 6 1296
梦谈多话
梦谈多话 2020-12-10 01:52

How can I return the Unicode Code Point of a character? For example, if the input is \"A\", then the output should be \"U+0041\". Ideally, a solution should take care of sur

6条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-10 02:10

    C# cannot store unicode codepoints in a char, as char is only 2 bytes and unicode codepoints routinely exceed that length. The solution is to either represent a codepoint as a sequence of bytes (either as a byte array or "flattened" into a 32-bit primitive) or as a string. The accepted answer converts to UTF32, but that's not always ideal.

    This is the code we use to split a string into its unicode codepoint components, but preserving the native UTF-16 encoding. The result is an enumerable that can be used to compare (sub)strings natively in C#/.NET:

        public class InvalidEncodingException : System.Exception
        { }
    
        public static IEnumerable UnicodeCodepoints(this string s)
        {
            for (int i = 0; i < s.Length; ++i)
            {
                if (Char.IsSurrogate(s[i]))
                {
                    if (s.Length < i + 2)
                    {
                        throw new InvalidEncodingException();
                    }
                    yield return string.Format("{0}{1}", s[i], s[++i]);
                }
                else
                {
                    yield return string.Format("{0}", s[i]);
                }
            }
        }
    }
    

提交回复
热议问题