I can\'t think of a way to remove the leading zeros. My goal was in a for loop to then create the UTF-8 and UTF-32 versions of each number.
For example,
As the Wikipedia UTF-8 page describes, each Unicode code point (0 through 0x10FFFF) is encoded in UTF-8 character as one to four bytes.
Here is a simple example function, edited from one of my earlier posts. I've now removed the U suffixes from the integer constants too. (.. whose intent was to remind the human programmer that the constants are explicitly unsigned for a reason (negative code points not considered at all), and it does assume unsigned int code -- the compiler does not care, and probably because of that this practice seems to be odd and confusing even to long-term members here, so I give up and stop trying to include such reminders. :( )
static size_t code_to_utf8(unsigned char *const buffer, const unsigned int code)
{
if (code <= 0x7F) {
buffer[0] = code;
return 1;
}
if (code <= 0x7FF) {
buffer[0] = 0xC0 | (code >> 6); /* 110xxxxx */
buffer[1] = 0x80 | (code & 0x3F); /* 10xxxxxx */
return 2;
}
if (code <= 0xFFFF) {
buffer[0] = 0xE0 | (code >> 12); /* 1110xxxx */
buffer[1] = 0x80 | ((code >> 6) & 0x3F); /* 10xxxxxx */
buffer[2] = 0x80 | (code & 0x3F); /* 10xxxxxx */
return 3;
}
if (code <= 0x10FFFF) {
buffer[0] = 0xF0 | (code >> 18); /* 11110xxx */
buffer[1] = 0x80 | ((code >> 12) & 0x3F); /* 10xxxxxx */
buffer[2] = 0x80 | ((code >> 6) & 0x3F); /* 10xxxxxx */
buffer[3] = 0x80 | (code & 0x3F); /* 10xxxxxx */
return 4;
}
return 0;
}
You supply it with an unsigned char array, four chars or larger, and the Unicode code point. The function will return how many chars were needed to encode the code point in UTF-8, and were assigned in the array. The function will return 0 (not encoded) for codes above 0x10FFFF, but it does not otherwise check that the Unicode code point is valid. Ie. it is a simple encoder, and all it knows about Unicode is that the code points are from 0 to 0x10FFFF, inclusive. It knows nothing about surrogate pairs, for example.
Note that because the code point is explicitly an unsigned integer, negative arguments will be converted to unsigned according to C rules.
You need to write a function that prints out the least 8 significant bits in each unsigned char (the C standard does allow larger char sizes, but UTF-8 only uses 8-bit chars). Then, use the above function to convert an Unicode code point (0 to 0x10FFFF, inclusive) to UTF-8 representation, and call your bit function for each unsigned char in the array, in increasing order, for the count of unsigned char the above conversion function returned for that code point.