How do I use 3 and 4-byte Unicode characters with standard C++ strings?

前端未结

关注

 5  1741

猫巷女王i 2020-12-24 09:01

In standard C++ we have char and wchar_t for storing characters. char can store values between 0x00 and 0xFF. And

5条回答

余生分开走 (楼主)

2020-12-24 09:41
In standard C++ we have char and wchar_t for storing characters? char can store values between 0x00 and 0xFF. And wchar_t can store values between 0x0000 and 0xFFFF

Not quite:
```
sizeof(char)     == 1   so 1 byte per character.
sizeof(wchar_t)  == ?   Depends on your system 
                        (for unix usually 4 for Windows usually 2).
```
Unicode characters consume up to 4-byte space.

Not quite. Unicode is not an encoding. Unicode is a standard the defines what each code point is and the code points are restricted to 21 bits. The first 16 bits defined the character position on a code plain while the following 5 bits defines which plain the character is on.

There are several unicode encodings (UTF-8, UTF-16 and UTF-32 being the most common) this is how you store the characters in memory. There are practical differences between the three.
```
    UTF-8:   Great for storage and transport (as it is compact)
             Bad because it is variable length
    UTF-16:  Horrible in nearly all regards
             It is always large and it is variable length
             (anything not on the BMP needs to be encoded as surrogate pairs)
    UTF-32:  Great for in memory representations as it is fixed size
             Bad because it takes 4 bytes for each character which is usually overkill
```
Personally I use UTF-8 for transport and storage and UTF-32 for in memory representation of text.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...