How would you get an array of Unicode code points from a .NET String?

前端未结

关注

 5  1399

I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wa

相关标签:

5条回答

北海茫月

2020-12-09 04:14

This answer is not correct. See @Virtlink's answer for the correct one.

static int[] ExtractScalars(string s)
{
  if (!s.IsNormalized())
  {
    s = s.Normalize();
  }

  List<int> chars = new List<int>((s.Length * 3) / 2);

  var ee = StringInfo.GetTextElementEnumerator(s);

  while (ee.MoveNext())
  {
    string e = ee.GetTextElement();
    chars.Add(char.ConvertToUtf32(e, 0));
  }

  return chars.ToArray();
}

Notes: Normalization is required to deal with composite characters.

0 讨论(0)

春和景丽

2020-12-09 04:16
You are asking about code points. In UTF-16 (C#'s char) there are only two possibilities:
1. The character is from the Basic Multilingual Plane, and is encoded by a single code unit.
2. The character is outside the BMP, and encoded using a surrogare high-low pair of code units
Therefore, assuming the string is valid, this returns an array of code points for a given string:
```
public static int[] ToCodePoints(string str)
{
    if (str == null)
        throw new ArgumentNullException("str");

    var codePoints = new List<int>(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        codePoints.Add(Char.ConvertToUtf32(str, i));
        if (Char.IsHighSurrogate(str[i]))
            i += 1;
    }

    return codePoints.ToArray();
}
```
An example with a surrogate pair
0 讨论(0) 发布评论: 提交评论加载中...


          	          
            
           
            
                              
                
              
              
                
                  梦谈多话        
                
              
                            
                2020-12-09 04:17
              
            
            
                                                                       
I came up with the same approach suggested by Nicholas (and Jeppe), just shorter:

    public static IEnumerable<int> GetCodePoints(this string s) {
        var utf32 = new UTF32Encoding(!BitConverter.IsLittleEndian, false, true);
        var bytes = utf32.GetBytes(s);
        return Enumerable.Range(0, bytes.Length / 4).Select(i => BitConverter.ToInt32(bytes, i * 4));
    }


The enumeration was all I needed, but getting an array is trivial:

int[] codePoints = myString.GetCodePoints().ToArray();

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2020-12-09 04:18
              
            
            
                                                                       
Doesn't seem like it should be much  more complicated than this:

public static IEnumerable<int> Utf32CodePoints( this IEnumerable<char> s )
{
  bool      useBigEndian = !BitConverter.IsLittleEndian;
  Encoding  utf32        = new UTF32Encoding( useBigEndian , false , true ) ;
  byte[]    octets       = utf32.GetBytes( s ) ;

  for ( int i = 0 ; i < octets.Length ; i+=4 )
  {
    int codePoint = BitConverter.ToInt32(octets,i);
    yield return codePoint;
  }

}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲&欢浪女        
                
              
                            
                2020-12-09 04:29
              
            
            
                                                                       
This solution produces the same results as the solution by Daniel A.A. Pelsmaeker but is a little bit shorter:

public static int[] ToCodePoints(string s)
{
    byte[] utf32bytes = Encoding.UTF32.GetBytes(s);
    int[] codepoints = new int[utf32bytes.Length / 4];
    Buffer.BlockCopy(utf32bytes, 0, codepoints, 0, utf32bytes.Length);
    return codepoints;
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...