发表新帖

发表新帖

How do I use 3 and 4-byte Unicode characters with standard C++ strings?

前端未结

关注

 5  1716

猫巷女王i 2020-12-24 09:01

In standard C++ we have char and wchar_t for storing characters. char can store values between 0x00 and 0xFF. And

5条回答

北海茫月 (楼主)

2020-12-24 09:40

The size and meaning of wchar_t is implementation-defined. On Windows it's 16 bit as you say, on Unix-like systems it's often 32 bit but not always.

For that matter, a compiler is permitted do its own thing and pick a different size for wchar_t than what the system says -- it just won't be ABI-compatible with the rest of the system.

C++11 provides std::u32string, which is for representing strings of unicode code points. I believe that sufficiently recent Microsoft compilers include it. It's of somewhat limited use since Microsoft's system functions expect 16-bit wide characters (a.k.a UTF-16le), not 32-bit unicode code points (a.k.a UTF-32, UCS-4).

You mention UTF-8, though: UTF-8 encoded data can be stored in a regular std::string. Of course since it's a variable-length encoding, you can't access unicode code points by index, you can only access the bytes by index. But you'd normally write your code not to need to access code points by index anyway, even if using u32string. Unicode code points don't correspond 1-1 with printable characters ("graphemes") because of the existence of combining marks in Unicode, so many of the little tricks you play with strings when learning to program (reversing them, searching for substrings) don't work so easily with Unicode data no matter what you store it in.

The character

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题