char vs wchar_t vs char16_t vs char32_t (c++11)

后端 未结 2 975
耶瑟儿~
耶瑟儿~ 2020-12-13 01:57

From what I understand, a char is safe to house ASCII characters whereas char16_t and char32_t are safe to house characters from unico

2条回答
  •  孤城傲影
    2020-12-13 02:34

    char is for 8-bit code units, char16_t is for 16-bit code units, and char32_t is for 32-bit code units. Any of these can be used for 'Unicode'; UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units.


    The guarantee made for wchar_t was that any character supported in a locale could be converted from char to wchar_t, and whatever representation was used for char, be it multiple bytes, shift codes, what have you, the wchar_t would be a single, distinct value. The purpose of this was that then you could manipulate wchar_t strings just like the simple algorithms used with ASCII.

    For example, converting ascii to upper case goes like:

    auto loc = std::locale("");
    
    char s[] = "hello";
    for (char &c : s) {
      c = toupper(c, loc);
    }
    

    But this won't handle converting all characters in UTF-8 to uppercase, or all characters in some other encoding like Shift-JIS. People wanted to be able to internationalize this code like so:

    auto loc = std::locale("");
    
    wchar_t s[] = L"hello";
    for (wchar_t &c : s) {
      c = toupper(c, loc);
    }
    

    So every wchar_t is a 'character' and if it has an uppercase version then it can be directly converted. Unfortunately this doesn't really work all the time; For example there exist oddities in some languages such as the German letter ß where the uppercase version is actually the two characters SS instead of a single character.

    So internationalized text handling is intrinsically harder than ASCII and cannot really be simplified in the way the designers of wchar_t intended. As such wchar_t and wide characters in general provide little value.

    The only reason to use them is that they've been baked into some APIs and platforms. However, I prefer to stick to UTF-8 in my own code even when developing on such platforms, and to just convert at the API boundaries to whatever encoding is required.

提交回复
热议问题