Return code point of characters in C#

后端 未结 6 1281
梦谈多话
梦谈多话 2020-12-10 01:52

How can I return the Unicode Code Point of a character? For example, if the input is \"A\", then the output should be \"U+0041\". Ideally, a solution should take care of sur

相关标签:
6条回答
  • 2020-12-10 02:10

    The following code writes the codepoints of a string input to the console:

    string input = "\uD834\uDD61";
    
    for (var i = 0; i < input.Length; i += char.IsSurrogatePair(input, i) ? 2 : 1)
    {
        var codepoint = char.ConvertToUtf32(input, i);
    
        Console.WriteLine("U+{0:X4}", codepoint);
    }
    

    Output:

    U+1D161

    Since strings in .NET are UTF-16 encoded, the char values that make up the string need to be converted to UTF-32 first.

    0 讨论(0)
  • 2020-12-10 02:10

    C# cannot store unicode codepoints in a char, as char is only 2 bytes and unicode codepoints routinely exceed that length. The solution is to either represent a codepoint as a sequence of bytes (either as a byte array or "flattened" into a 32-bit primitive) or as a string. The accepted answer converts to UTF32, but that's not always ideal.

    This is the code we use to split a string into its unicode codepoint components, but preserving the native UTF-16 encoding. The result is an enumerable that can be used to compare (sub)strings natively in C#/.NET:

        public class InvalidEncodingException : System.Exception
        { }
    
        public static IEnumerable<string> UnicodeCodepoints(this string s)
        {
            for (int i = 0; i < s.Length; ++i)
            {
                if (Char.IsSurrogate(s[i]))
                {
                    if (s.Length < i + 2)
                    {
                        throw new InvalidEncodingException();
                    }
                    yield return string.Format("{0}{1}", s[i], s[++i]);
                }
                else
                {
                    yield return string.Format("{0}", s[i]);
                }
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-10 02:20

    I found a little method on msdn forum. Hope this helps.

        public int get_char_code(char character){ 
            UTF32Encoding encoding = new UTF32Encoding(); 
            byte[] bytes = encoding.GetBytes(character.ToString().ToCharArray()); 
            return BitConverter.ToInt32(bytes, 0); 
        } 
    
    0 讨论(0)
  • 2020-12-10 02:21
    public static string ToCodePointNotation(char c)
    {
    
        return "U+" + ((int)c).ToString("X4");
    }
    
    Console.WriteLine(ToCodePointNotation('a')); //U+0061
    
    0 讨论(0)
  • 2020-12-10 02:27

    Actually there is some merit in @Yogendra Singh 's answer, currently the only one with negative voting. The job can be done like this

        public static IEnumerable<int> Utf8ToCodePoints(this string s)
        {
            var utf32Bytes = Encoding.UTF32.GetBytes(s);
            var bytesPerCharInUtf32 = 4;
            Debug.Assert(utf32bytes.Length % bytesPerCharInUtf32 == 0);
            for (int i = 0; i < utf32bytes.Length; i+= bytesPerCharInUtf32)
            {
                yield return BitConverter.ToInt32(utf32bytes, i);
            }
        }
    

    Tested with

        var surrogatePairInput = "abc                                                                    
    0 讨论(0)
  • 2020-12-10 02:30

    Easy, since chars in C# is actually UTF16 code points:

    char x = 'A';
    Console.WriteLine("U+{0:x4}", (int)x);
    

    To address the comments, A char in C# is a 16 bit number, and holds a UTF16 code point. Code points above 16 the bit space cannot be represented in a C# character. Characters in C# is not variable width. A string however can have 2 chars following each other, each being a code unit, forming a UTF16 code point. If you have a string input and characters above the 16 bit space, you can use char.IsSurrogatePair and Char.ConvertToUtf32, as suggested in another answer:

    string input = ....
    for(int i = 0 ; i < input.Length ; i += Char.IsSurrogatePair(input,i) ? 2 : 1)
    {
        int x = Char.ConvertToUtf32(input, i);
        Console.WriteLine("U+{0:X4}", x);
    }
    
    0 讨论(0)
提交回复
热议问题