How would you get an array of Unicode code points from a .NET String?

前端 未结 5 1404
日久生厌
日久生厌 2020-12-09 03:47

I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wa

5条回答
  •  春和景丽
    2020-12-09 04:16

    You are asking about code points. In UTF-16 (C#'s char) there are only two possibilities:

    1. The character is from the Basic Multilingual Plane, and is encoded by a single code unit.
    2. The character is outside the BMP, and encoded using a surrogare high-low pair of code units

    Therefore, assuming the string is valid, this returns an array of code points for a given string:

    public static int[] ToCodePoints(string str)
    {
        if (str == null)
            throw new ArgumentNullException("str");
    
        var codePoints = new List(str.Length);
        for (int i = 0; i < str.Length; i++)
        {
            codePoints.Add(Char.ConvertToUtf32(str, i));
            if (Char.IsHighSurrogate(str[i]))
                i += 1;
        }
    
        return codePoints.ToArray();
    }
    

    An example with a surrogate pair

提交回复
热议问题