The proper way to handle Unicode with C++ in 2018?

后端未结

关注

 1  1418

I have tried searching stackoverflow to find an answer to this but the questions and answers I\'ve found are around 10 years old and I can\'t seem to find c

相关标签:

1条回答

迷失自我

2020-12-23 10:04
I will try to throw some ideas here:
- most C++ programs/programmers just assume that a text is an almost opaque sequence of bytes. UTF-8 is probably guilty for that, and there is no surprise that many comments resume to: don't worry with Unicode, just process UTF-8 encoded strings
- files only contains bytes. At a moment, if you try to internally process true Unicode code points, you will have to serialize that to bytes -> here again UTF-8 wins the point
- as soon as you go out of the Basic Multilingual Plane (16 bits code points), things become more and more complex. The emoji is specifically awful to process: an emoji can be followed by a variation selector (U+FE0E VARIATION SELECTOR-15 (VS15) for text or U+FE0F VARIATION SELECTOR-16 (VS16) for emoji-style) to alter its display style, more or less the old i bs ^ that was used in 1970 ascii when one wanted to print î. That's not all, the characters U+1F3FB to U+1F3FF are use to provide a skin color for 102 human emoji spread across six blocks: Dingbats, Emoticons, Miscellaneous Symbols, Miscellaneous Symbols and Pictographs, Supplemental Symbols and Pictographs, and Transport and Map Symbols.
  
  That simply means that up to 3 consecutive unicode code points can represent one single glyph... So the idea that one character is one char32_t is still an approximation
My conclusion is that Unicode is a complex thing, and really requires a dedicated library like ICU. You can try to use simple tools like the converters of the standard library when you only deal with the BMP, but full support is far beyond that.

BTW: even other languages like Python that pretend to have a native unicode support (which is IMHO far better than current C++ one) ofter fails on some part:
- the tkinter GUI library cannot display any code point outside the BMP - while it is the standard IDLE Python tool
- different modules or the standard library are dedicated to Unicode in addition to the core language support (codecs and unicodedata), and other modules are available in the Python Package Index like the emoji support because the standard library does not meet all needs
So support for Unicode is poor for more than 10 years, and I do not really hope that things will go much better in the next 10 years...
0 讨论(0)
发布评论:

提交评论
- 加载中...