Return code point of characters in C#

孤街浪徒 提交于 2019-11-30 07:55:00

问题


How can I return the Unicode Code Point of a character? For example, if the input is "A", then the output should be "U+0041". Ideally, a solution should take care of surrogate pairs.

With code point I mean the actual code point according to Unicode, which is different from code unit (UTF8 has 8-bit code units, UTF16 has 16-bit code units and UTF32 has 32-bit code units, in the latter case the value is equal to the code point, after taking endianness into account).


回答1:


Easy, since chars in C# is actually UTF16 code points:

char x = 'A';
Console.WriteLine("U+{0:x4}", (int)x);

To address the comments, A char in C# is a 16 bit number, and holds a UTF16 code point. Code points above 16 the bit space cannot be represented in a C# character. Characters in C# is not variable width. A string however can have 2 chars following each other, each being a code unit, forming a UTF16 code point. If you have a string input and characters above the 16 bit space, you can use char.IsSurrogatePair and Char.ConvertToUtf32, as suggested in another answer:

string input = ....
for(int i = 0 ; i < input.Length ; i += Char.IsSurrogatePair(input,i) ? 2 : 1)
{
    int x = Char.ConvertToUtf32(input, i);
    Console.WriteLine("U+{0:X4}", x);
}



回答2:


The following code writes the codepoints of a string input to the console:

string input = "\uD834\uDD61";

for (var i = 0; i < input.Length; i += char.IsSurrogatePair(input, i) ? 2 : 1)
{
    var codepoint = char.ConvertToUtf32(input, i);

    Console.WriteLine("U+{0:X4}", codepoint);
}

Output:

U+1D161

Since strings in .NET are UTF-16 encoded, the char values that make up the string need to be converted to UTF-32 first.




回答3:


C# cannot store unicode codepoints in a char, as char is only 2 bytes and unicode codepoints routinely exceed that length. The solution is to either represent a codepoint as a sequence of bytes (either as a byte array or "flattened" into a 32-bit primitive) or as a string. The accepted answer converts to UTF32, but that's not always ideal.

This is the code we use to split a string into its unicode codepoint components, but preserving the native UTF-16 encoding. The result is an enumerable that can be used to compare (sub)strings natively in C#/.NET:

    public class InvalidEncodingException : System.Exception
    { }

    public static IEnumerable<string> UnicodeCodepoints(this string s)
    {
        for (int i = 0; i < s.Length; ++i)
        {
            if (Char.IsSurrogate(s[i]))
            {
                if (s.Length < i + 2)
                {
                    throw new InvalidEncodingException();
                }
                yield return string.Format("{0}{1}", s[i], s[++i]);
            }
            else
            {
                yield return string.Format("{0}", s[i]);
            }
        }
    }
}



回答4:


Actually there is some merit in @Yogendra Singh 's answer, currently the only one with negative voting. The job can be done like this

    public static IEnumerable<int> Utf8ToCodePoints(this string s)
    {
        var utf32Bytes = Encoding.UTF32.GetBytes(s);
        var bytesPerCharInUtf32 = 4;
        Debug.Assert(utf32bytes.Length % bytesPerCharInUtf32 == 0);
        for (int i = 0; i < utf32bytes.Length; i+= bytesPerCharInUtf32)
        {
            yield return BitConverter.ToInt32(utf32bytes, i);
        }
    }

Tested with

    var surrogatePairInput = "abc💩";
    Debug.Assert(surrogatePairInput.Length == 5);
    var pointsAsString = string.Join(";" , 
        surrogatePairInput
        .Utf8ToCodePoints()
        .Select(p => $"U+{p:X4}"));
    Debug.Assert(pointsAsString == "U+0061;U+0062;U+0063;U+1F4A9");

Example is relevant because the pile of poo is represented as a surrogate pair.




回答5:


I found a little method on msdn forum. Hope this helps.

    public int get_char_code(char character){ 
        UTF32Encoding encoding = new UTF32Encoding(); 
        byte[] bytes = encoding.GetBytes(character.ToString().ToCharArray()); 
        return BitConverter.ToInt32(bytes, 0); 
    } 



回答6:


public static string ToCodePointNotation(char c)
{

    return "U+" + ((int)c).ToString("X4");
}

Console.WriteLine(ToCodePointNotation('a')); //U+0061


来源:https://stackoverflow.com/questions/13894021/return-code-point-of-characters-in-c-sharp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!