My platform is a Mac and C++11 (or above). I\'m a C++ beginner and working on a personal project which processes Chinese and English. UTF-8 is the preferred encoding for thi
Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:
This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.
Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.
In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:
std::string and std::wstring.std::wstring if you care about portability (wchar_t is only 16 bits on Windows); use std::u32string instead (aka std::basic_string).std::string or std::wstring) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing).wchar_t ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.If you are only reading or composing strings, you should have no to little issues with std::string or std::wstring.
Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.
std::string or std::u32string?If performance is a concern, it is likely that std::string will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.
If Grapheme Clusters are not a problem, then std::u32string has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string work out of the box.
If you interface with software taking std::string or char*/char const*, then stick to std::string to avoid back-and-forth conversions. It'll be a pain otherwise.
std::string.UTF-8 actually works quite well in std::string.
Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.
Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:
str.find('\n') works,str.find("...") works for matching byte by byte1,str.find_first_of("\r\n") works if searching for ASCII characters.Similarly, regex should mostly works out of the box. As a sequence of characters ("haha") is just a sequence of bytes ("哈"), basic search patterns should work out of the box.
Be wary, however, of character classes (such as [:alphanum:]), as depending on the regex flavor and implementation it may or may not match Unicode characters.
Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?" may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?".
1 The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.