Convert Unicode code points to UTF-8 and UTF-32

后端 未结 3 1659
自闭症患者
自闭症患者 2020-12-29 17:09

I can\'t think of a way to remove the leading zeros. My goal was in a for loop to then create the UTF-8 and UTF-32 versions of each number.

For example,

相关标签:
3条回答
  • 2020-12-29 17:18

    Converting to UTF-32 is trivial, it's just the Unicode code point.

    #include <wchar.h>
    
    wint_t codepoint_to_utf32( const wint_t codepoint ) {
        if( codepoint > 0x10FFFF ) {
            fprintf( stderr, "Codepoint %x is out of UTF-32 range\n", codepoint);
            return -1;
        }
    
        return codepoint;
    }
    

    Note that I'm using wint_t, w for "wide". That's an integer which is guaranteed to be large enough to hold any wchar_t as well as EOF. wchar_t (wide character) is guaranteed to be wide enough to support all system locales.

    Converting to UTF-8 is a bit more complicated because of its codepage layout designed to be compatible with 7-bit ASCII. Some bit shifting is required.

    Start with the UTF-8 table.

    U+0000  U+007F    0xxxxxxx
    U+0080  U+07FF    110xxxxx  10xxxxxx
    U+0800  U+FFFF    1110xxxx  10xxxxxx    10xxxxxx
    U+10000 U+10FFFF  11110xxx  10xxxxxx    10xxxxxx    10xxxxxx
    

    Turn that into a big if/else if statement.

    wint_t codepoint_to_utf8( const wint_t codepoint ) {
        wint_t utf8 = 0;
    
        // U+0000   U+007F    0xxxxxxx
        if( codepoint <= 0x007F ) {
        }
        // U+0080   U+07FF    110xxxxx  10xxxxxx
        else if( codepoint <= 0x07FF ) {
        }
        // U+0800   U+FFFF    1110xxxx  10xxxxxx    10xxxxxx
        else if( codepoint <= 0xFFFF ) {
        }
        // U+10000  U+10FFFF  11110xxx  10xxxxxx    10xxxxxx    10xxxxxx
        else if( codepoint <= 0x10FFFF ) {
        }
        else {
            fprintf( stderr, "Codepoint %x is out of UTF-8 range\n", codepoint);
            return -1;
        }
    
        return utf8;
    }
    

    And start filling in the blanks. The first one is easy, it's just the code point.

        // U+0000   U+007F    0xxxxxxx
        if( codepoint <= 0x007F ) {
            utf8 = codepoint;
        }
    

    To do the next one, we need to apply a bit mask and do some bit shifting. C doesn't support binary literals, so I converted the binary into hex using perl -wle 'printf("%x\n", 0b1100000010000000)'

        // U+0080   U+07FF    110xxxxx  10xxxxxx
        else if( codepoint <= 0x00007FF ) {
            // Start at 1100000010000000
            utf8 = 0xC080;
    
            // 6 low bits using the bitmask 00111111
            // That fills in the 10xxxxxx part.
            utf8 += codepoint & 0x3f;
    
            // 5 high bits using the bitmask 11111000000
            // Shift over 2 to jump the hard coded 10 in the low byte.
            // That fills in the 110xxxxx part.
            utf8 += (codepoint & 0x7c0) << 2;
        }
    

    I'll leave the rest to you.

    We can test this with various interesting values that touch each piece of logic.

    int main() {    
        // https://codepoints.net/U+0041
        printf("LATIN CAPITAL LETTER A: %x\n", codepoint_to_utf8(0x0041));
        // https://codepoints.net/U+00A2
        printf("Cent sign: %x\n", codepoint_to_utf8(0x00A2));
        // https://codepoints.net/U+2603
        printf("Snowman: %x\n", codepoint_to_utf8(0x02603));
        // https://codepoints.net/U+10160
        printf("GREEK ACROPHONIC TROEZENIAN TEN: %x\n", codepoint_to_utf8(0x10160));
    
        printf("Out of range: %x\n", codepoint_to_utf8(0x00200000));
    }
    

    This is an interesting exercise, but if you want to do this for real use a pre-existing library. Gnome Lib has Unicode manipulation functions, and a lot more missing pieces of C.

    0 讨论(0)
  • 2020-12-29 17:29

    As the Wikipedia UTF-8 page describes, each Unicode code point (0 through 0x10FFFF) is encoded in UTF-8 character as one to four bytes.

    Here is a simple example function, edited from one of my earlier posts. I've now removed the U suffixes from the integer constants too. (.. whose intent was to remind the human programmer that the constants are explicitly unsigned for a reason (negative code points not considered at all), and it does assume unsigned int code -- the compiler does not care, and probably because of that this practice seems to be odd and confusing even to long-term members here, so I give up and stop trying to include such reminders. :( )

    static size_t code_to_utf8(unsigned char *const buffer, const unsigned int code)
    {
        if (code <= 0x7F) {
            buffer[0] = code;
            return 1;
        }
        if (code <= 0x7FF) {
            buffer[0] = 0xC0 | (code >> 6);            /* 110xxxxx */
            buffer[1] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
            return 2;
        }
        if (code <= 0xFFFF) {
            buffer[0] = 0xE0 | (code >> 12);           /* 1110xxxx */
            buffer[1] = 0x80 | ((code >> 6) & 0x3F);   /* 10xxxxxx */
            buffer[2] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
            return 3;
        }
        if (code <= 0x10FFFF) {
            buffer[0] = 0xF0 | (code >> 18);           /* 11110xxx */
            buffer[1] = 0x80 | ((code >> 12) & 0x3F);  /* 10xxxxxx */
            buffer[2] = 0x80 | ((code >> 6) & 0x3F);   /* 10xxxxxx */
            buffer[3] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
            return 4;
        }
        return 0;
    }
    

    You supply it with an unsigned char array, four chars or larger, and the Unicode code point. The function will return how many chars were needed to encode the code point in UTF-8, and were assigned in the array. The function will return 0 (not encoded) for codes above 0x10FFFF, but it does not otherwise check that the Unicode code point is valid. Ie. it is a simple encoder, and all it knows about Unicode is that the code points are from 0 to 0x10FFFF, inclusive. It knows nothing about surrogate pairs, for example.

    Note that because the code point is explicitly an unsigned integer, negative arguments will be converted to unsigned according to C rules.

    You need to write a function that prints out the least 8 significant bits in each unsigned char (the C standard does allow larger char sizes, but UTF-8 only uses 8-bit chars). Then, use the above function to convert an Unicode code point (0 to 0x10FFFF, inclusive) to UTF-8 representation, and call your bit function for each unsigned char in the array, in increasing order, for the count of unsigned char the above conversion function returned for that code point.

    0 讨论(0)
  • 2020-12-29 17:32

    Many ways to do this fun exercise, converting a code point to UTF-8.

    As not to give it all the coding experience away, following is a pseudo code to get OP started.

    #define UTF_WIDTH1_MAX       0x7F
    #define UTF_WIDTH2_MAX       0x7FF
    #define UTF_WIDTH3_MAX       0xFFFF
    #define UTF_WIDTH4_MAX       0x10FFFF
    
    void PrintCodepointUTF8(uint32_t codepoint) {
      uint8_t first;
      uint8_t continuation_bytes[3];
      unsigned continuation_bytes_n;
      if (codepoint <= UTF_WIDTH1_MAX) {
        first = codepoint;
        continuation_bytes = 0;
      } else if (codepoint <= UTF_WIDTH2_MAX) {
        // extract 5 bits for first and 6 bits for one continuation_byte
        // and set some bits
        first = ...;
        continuation_bytes = ...
        continuation_bytes_n = 1;
      } else   if (codepoint <= UTF_WIDTH4_MAX) {
        if (isasurrogate(codepoint)) fail.
        // else extract 4 bits for first and 6 bits for each continuation_byte
        // and set some bits
        first = ...;
        continuation_bytes = ...
        continuation_bytes_n = 2;
      } else   if (codepoint <= UTF_WIDTH4_MAX) {
        // extract 3 bits for first and 6 bits for each continuation_byte
        // and set some bits
        first = ...;
        continuation_bytes = ...
        continuation_bytes_n = 3;
      } else {
        fail out of range.
      }
      print first and 0-3 continuation_bytes
    }
    
    0 讨论(0)
提交回复
热议问题