Convert Unicode code points to UTF-8 and UTF-32

后端 未结 3 1660
自闭症患者
自闭症患者 2020-12-29 17:09

I can\'t think of a way to remove the leading zeros. My goal was in a for loop to then create the UTF-8 and UTF-32 versions of each number.

For example,

3条回答
  •  粉色の甜心
    2020-12-29 17:18

    Converting to UTF-32 is trivial, it's just the Unicode code point.

    #include 
    
    wint_t codepoint_to_utf32( const wint_t codepoint ) {
        if( codepoint > 0x10FFFF ) {
            fprintf( stderr, "Codepoint %x is out of UTF-32 range\n", codepoint);
            return -1;
        }
    
        return codepoint;
    }
    

    Note that I'm using wint_t, w for "wide". That's an integer which is guaranteed to be large enough to hold any wchar_t as well as EOF. wchar_t (wide character) is guaranteed to be wide enough to support all system locales.

    Converting to UTF-8 is a bit more complicated because of its codepage layout designed to be compatible with 7-bit ASCII. Some bit shifting is required.

    Start with the UTF-8 table.

    U+0000  U+007F    0xxxxxxx
    U+0080  U+07FF    110xxxxx  10xxxxxx
    U+0800  U+FFFF    1110xxxx  10xxxxxx    10xxxxxx
    U+10000 U+10FFFF  11110xxx  10xxxxxx    10xxxxxx    10xxxxxx
    

    Turn that into a big if/else if statement.

    wint_t codepoint_to_utf8( const wint_t codepoint ) {
        wint_t utf8 = 0;
    
        // U+0000   U+007F    0xxxxxxx
        if( codepoint <= 0x007F ) {
        }
        // U+0080   U+07FF    110xxxxx  10xxxxxx
        else if( codepoint <= 0x07FF ) {
        }
        // U+0800   U+FFFF    1110xxxx  10xxxxxx    10xxxxxx
        else if( codepoint <= 0xFFFF ) {
        }
        // U+10000  U+10FFFF  11110xxx  10xxxxxx    10xxxxxx    10xxxxxx
        else if( codepoint <= 0x10FFFF ) {
        }
        else {
            fprintf( stderr, "Codepoint %x is out of UTF-8 range\n", codepoint);
            return -1;
        }
    
        return utf8;
    }
    

    And start filling in the blanks. The first one is easy, it's just the code point.

        // U+0000   U+007F    0xxxxxxx
        if( codepoint <= 0x007F ) {
            utf8 = codepoint;
        }
    

    To do the next one, we need to apply a bit mask and do some bit shifting. C doesn't support binary literals, so I converted the binary into hex using perl -wle 'printf("%x\n", 0b1100000010000000)'

        // U+0080   U+07FF    110xxxxx  10xxxxxx
        else if( codepoint <= 0x00007FF ) {
            // Start at 1100000010000000
            utf8 = 0xC080;
    
            // 6 low bits using the bitmask 00111111
            // That fills in the 10xxxxxx part.
            utf8 += codepoint & 0x3f;
    
            // 5 high bits using the bitmask 11111000000
            // Shift over 2 to jump the hard coded 10 in the low byte.
            // That fills in the 110xxxxx part.
            utf8 += (codepoint & 0x7c0) << 2;
        }
    

    I'll leave the rest to you.

    We can test this with various interesting values that touch each piece of logic.

    int main() {    
        // https://codepoints.net/U+0041
        printf("LATIN CAPITAL LETTER A: %x\n", codepoint_to_utf8(0x0041));
        // https://codepoints.net/U+00A2
        printf("Cent sign: %x\n", codepoint_to_utf8(0x00A2));
        // https://codepoints.net/U+2603
        printf("Snowman: %x\n", codepoint_to_utf8(0x02603));
        // https://codepoints.net/U+10160
        printf("GREEK ACROPHONIC TROEZENIAN TEN: %x\n", codepoint_to_utf8(0x10160));
    
        printf("Out of range: %x\n", codepoint_to_utf8(0x00200000));
    }
    

    This is an interesting exercise, but if you want to do this for real use a pre-existing library. Gnome Lib has Unicode manipulation functions, and a lot more missing pieces of C.

提交回复
热议问题