How do I use 3 and 4-byte Unicode characters with standard C++ strings?

前端 未结 5 1741
猫巷女王i
猫巷女王i 2020-12-24 09:01

In standard C++ we have char and wchar_t for storing characters. char can store values between 0x00 and 0xFF. And

5条回答
  •  余生分开走
    2020-12-24 09:41

    In standard C++ we have char and wchar_t for storing characters? char can store values between 0x00 and 0xFF. And wchar_t can store values between 0x0000 and 0xFFFF

    Not quite:

    sizeof(char)     == 1   so 1 byte per character.
    sizeof(wchar_t)  == ?   Depends on your system 
                            (for unix usually 4 for Windows usually 2).
    

    Unicode characters consume up to 4-byte space.

    Not quite. Unicode is not an encoding. Unicode is a standard the defines what each code point is and the code points are restricted to 21 bits. The first 16 bits defined the character position on a code plain while the following 5 bits defines which plain the character is on.

    There are several unicode encodings (UTF-8, UTF-16 and UTF-32 being the most common) this is how you store the characters in memory. There are practical differences between the three.

        UTF-8:   Great for storage and transport (as it is compact)
                 Bad because it is variable length
        UTF-16:  Horrible in nearly all regards
                 It is always large and it is variable length
                 (anything not on the BMP needs to be encoded as surrogate pairs)
        UTF-32:  Great for in memory representations as it is fixed size
                 Bad because it takes 4 bytes for each character which is usually overkill
    

    Personally I use UTF-8 for transport and storage and UTF-32 for in memory representation of text.

提交回复
热议问题