I have a list of character range restrictions that I need to check a string against, but the char
type in .NET is UTF-16 and therefore some characters become wa
This answer is not correct. See @Virtlink's answer for the correct one.
static int[] ExtractScalars(string s)
{
if (!s.IsNormalized())
{
s = s.Normalize();
}
List<int> chars = new List<int>((s.Length * 3) / 2);
var ee = StringInfo.GetTextElementEnumerator(s);
while (ee.MoveNext())
{
string e = ee.GetTextElement();
chars.Add(char.ConvertToUtf32(e, 0));
}
return chars.ToArray();
}
Notes: Normalization is required to deal with composite characters.
You are asking about code points. In UTF-16 (C#'s char
) there are only two possibilities:
Therefore, assuming the string is valid, this returns an array of code points for a given string:
public static int[] ToCodePoints(string str)
{
if (str == null)
throw new ArgumentNullException("str");
var codePoints = new List<int>(str.Length);
for (int i = 0; i < str.Length; i++)
{
codePoints.Add(Char.ConvertToUtf32(str, i));
if (Char.IsHighSurrogate(str[i]))
i += 1;
}
return codePoints.ToArray();
}
An example with a surrogate pair
I came up with the same approach suggested by Nicholas (and Jeppe), just shorter:
public static IEnumerable<int> GetCodePoints(this string s) {
var utf32 = new UTF32Encoding(!BitConverter.IsLittleEndian, false, true);
var bytes = utf32.GetBytes(s);
return Enumerable.Range(0, bytes.Length / 4).Select(i => BitConverter.ToInt32(bytes, i * 4));
}
The enumeration was all I needed, but getting an array is trivial:
int[] codePoints = myString.GetCodePoints().ToArray();
Doesn't seem like it should be much more complicated than this:
public static IEnumerable<int> Utf32CodePoints( this IEnumerable<char> s )
{
bool useBigEndian = !BitConverter.IsLittleEndian;
Encoding utf32 = new UTF32Encoding( useBigEndian , false , true ) ;
byte[] octets = utf32.GetBytes( s ) ;
for ( int i = 0 ; i < octets.Length ; i+=4 )
{
int codePoint = BitConverter.ToInt32(octets,i);
yield return codePoint;
}
}
This solution produces the same results as the solution by Daniel A.A. Pelsmaeker but is a little bit shorter:
public static int[] ToCodePoints(string s)
{
byte[] utf32bytes = Encoding.UTF32.GetBytes(s);
int[] codepoints = new int[utf32bytes.Length / 4];
Buffer.BlockCopy(utf32bytes, 0, codepoints, 0, utf32bytes.Length);
return codepoints;
}