Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

前端 未结 8 1326
半阙折子戏
半阙折子戏 2021-02-09 06:23

Please clarify for me, how does UTF16 work? I am a little confused, considering these points:

  • There is a static type in C++, WCHAR, which is 2 bytes long. (alway
8条回答
  •  轮回少年
    2021-02-09 07:07

    There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly)

    Well WCHAR is an MS thing not a C++ thing.
    But there is a wchar_t for wide character. Though this is not always 2. On Linux system it is usually 4 bytes.

    Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way.

    Do they. I can believe it.

    There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.

    C/C++ make no assumption avout character encoding. Though the OS can. For example Windows uses UTF-16 as the interface while a lot of Linus use UTF-32. But you need to read the documentation for each interface to know explicitly.

    To my uncertain knowledge, unicode has a lot more characters than 65535, so they obvisouly don't have enough space in 2 bytes.

    2 bytes is all you need for numbers 0 -> 65535

    But UCS (the encoding that UTF is based on) has 20 bits per code point. Thus some code points are encoded as 2 16byte characters in UTF-16 (These are refereed to as surrogate pairs).

    UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.

    UTF-8/UTF-16 and UTF-32 all encode the same set of code points (which are 20 bytes per code point). UTF-32 is the only one that has a fixed size (UTF-16 was supposed to be fixed size but then they found lots of other characters (Like Klingon) that we needed to encode and we ran out of space in plane 0. So we added 32 more plains (hence the four extra bits).

    So if a UTF16 character not always 2 bytes long, how long else could it be? 3 bytes? or only multiples of 2?

    It is either 1 16 bit character or 2 16 bit characters.

    And then for example if there is a winapi function that wants to know the size of a wide string in characters, and the string contains 2 characters which are each 4 bytes long, how is the size of that string in characters calculated?

    You have to step along and calculate each character one at a time.

    Is it 2 chars long or 4 chars long? (since it is 8 bytes long, and each WCHAR is 2 bytes)

    All depneds on your system

提交回复
热议问题