Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

前端 未结 8 1325
半阙折子戏
半阙折子戏 2021-02-09 06:23

Please clarify for me, how does UTF16 work? I am a little confused, considering these points:

  • There is a static type in C++, WCHAR, which is 2 bytes long. (alway
8条回答
  •  轮回少年
    2021-02-09 06:54

    You seem to have several misconception.

    There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly)

    This is wrong. Assuming you refer to the c++ type wchar_t - It is not always 2 bytes long, 4 bytes is also a common value, and there's no restriction that it can be only those two values. If you don't refer to that, it isn't in C++ but is some platform-specific type.

    • There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.

    • UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.

    UTF-8 and UTF-16 are different encodings for the same character set, so UTF-16 is not "bigger". Technically, the scheme used in UTF-8 could encode more characters than the scheme used in UTF-16, but as UTF-8 and UTF-16 they encode the same set.

    Don't use the term "character" lightly when it comes to unicode. A codeunit in UTF-16 is 2 bytes wide, a codepoint is represented by 1 or 2 codeunits. What humans usually understand as "characters" is different and can be composed of one or more codepoints, and if you as a programmer confuse codepoints with characters bad things can happen like http://ideone.com/qV2il

提交回复
热议问题