How would you get an array of Unicode code points from a .NET String?

前端 未结 5 1399
日久生厌
日久生厌 2020-12-09 03:47

I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wa

相关标签:
5条回答
  • 2020-12-09 04:14

    This answer is not correct. See @Virtlink's answer for the correct one.

    static int[] ExtractScalars(string s)
    {
      if (!s.IsNormalized())
      {
        s = s.Normalize();
      }
    
      List<int> chars = new List<int>((s.Length * 3) / 2);
    
      var ee = StringInfo.GetTextElementEnumerator(s);
    
      while (ee.MoveNext())
      {
        string e = ee.GetTextElement();
        chars.Add(char.ConvertToUtf32(e, 0));
      }
    
      return chars.ToArray();
    }
    

    Notes: Normalization is required to deal with composite characters.

    0 讨论(0)
  • 2020-12-09 04:16

    You are asking about code points. In UTF-16 (C#'s char) there are only two possibilities:

    1. The character is from the Basic Multilingual Plane, and is encoded by a single code unit.
    2. The character is outside the BMP, and encoded using a surrogare high-low pair of code units

    Therefore, assuming the string is valid, this returns an array of code points for a given string:

    public static int[] ToCodePoints(string str)
    {
        if (str == null)
            throw new ArgumentNullException("str");
    
        var codePoints = new List<int>(str.Length);
        for (int i = 0; i < str.Length; i++)
        {
            codePoints.Add(Char.ConvertToUtf32(str, i));
            if (Char.IsHighSurrogate(str[i]))
                i += 1;
        }
    
        return codePoints.ToArray();
    }
    

    An example with a surrogate pair

    0 讨论(0)
  • 2020-12-09 04:17

    I came up with the same approach suggested by Nicholas (and Jeppe), just shorter:

        public static IEnumerable<int> GetCodePoints(this string s) {
            var utf32 = new UTF32Encoding(!BitConverter.IsLittleEndian, false, true);
            var bytes = utf32.GetBytes(s);
            return Enumerable.Range(0, bytes.Length / 4).Select(i => BitConverter.ToInt32(bytes, i * 4));
        }
    

    The enumeration was all I needed, but getting an array is trivial:

    int[] codePoints = myString.GetCodePoints().ToArray();
    
    0 讨论(0)
  • 2020-12-09 04:18

    Doesn't seem like it should be much more complicated than this:

    public static IEnumerable<int> Utf32CodePoints( this IEnumerable<char> s )
    {
      bool      useBigEndian = !BitConverter.IsLittleEndian;
      Encoding  utf32        = new UTF32Encoding( useBigEndian , false , true ) ;
      byte[]    octets       = utf32.GetBytes( s ) ;
    
      for ( int i = 0 ; i < octets.Length ; i+=4 )
      {
        int codePoint = BitConverter.ToInt32(octets,i);
        yield return codePoint;
      }
    
    }
    
    0 讨论(0)
  • 2020-12-09 04:29

    This solution produces the same results as the solution by Daniel A.A. Pelsmaeker but is a little bit shorter:

    public static int[] ToCodePoints(string s)
    {
        byte[] utf32bytes = Encoding.UTF32.GetBytes(s);
        int[] codepoints = new int[utf32bytes.Length / 4];
        Buffer.BlockCopy(utf32bytes, 0, codepoints, 0, utf32bytes.Length);
        return codepoints;
    }
    
    0 讨论(0)
提交回复
热议问题