Does the C++ standard mandate an encoding for wchar_t?

后端 未结 7 2231
礼貌的吻别
礼貌的吻别 2021-02-10 07:09

Here are some excerpts from my copy of the 2014 draft standard N4140

22.5 Standard code conversion facets [locale.stdcvt]

3 F

7条回答
  •  轮回少年
    2021-02-10 07:56

    The first interpretation is conditionally true.

    If __STDC_ISO_10646__ macro (imported from C) is defined, then wchar_t is a superset of some version of Unicode.

    __STDC_ISO_10646__
    An integer literal of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month.

    It appears that if the macro is defined, some kind of UCS4 can be assumed. (Not UCS2 as ISO 10646 never had a 16-bit version; the first release of ISO 10646 corresponds to Unicode 2.0).

    So if the macro is defined, then

    • there is a "native" wchar_t encoding
    • it is a superset of some version of UCS4
    • the conversion provided by codecvt_utf8 is compatible with this native encoding

    None of these things are required to hold if the macro is not defined.

    There are also __STDC_UTF_16__ and __STDC_UTF_32__ but the C++ standard doesn't say what they mean. The C standard says that they signify UTF-16 and UTF-32 encodings for char16_t and char32_t respectively, but in C++ these encodings are always used.

    Incidentally, the functions mbrtoc32 and c32rtomb convert back and forth between char sequences and char32_t sequences. In C they only use UTF-32 if __STDC_UTF_32__ is defined, but in C++ UTF-32 is always used for char32_t. So it would appear than even if __STDC_ISO_10646__ is not defined, it should be possible to convert between UTF-8 and wchar_t by going from UTF-8 to UTF-32-encoded char32_t to natively encoded char to natively encoded wchar_t, but I'm afraid of this complex stuff.

提交回复
热议问题