Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

前端未结

关注

 8  1325

半阙折子戏 2021-02-09 06:23

Please clarify for me, how does UTF16 work? I am a little confused, considering these points:

There is a static type in C++, WCHAR, ~~which is 2 bytes long. (alway~~

8条回答

轮回少年 (楼主)

2021-02-09 06:54

You seem to have several misconception.

There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly)

This is wrong. Assuming you refer to the c++ type wchar_t - It is not always 2 bytes long, 4 bytes is also a common value, and there's no restriction that it can be only those two values. If you don't refer to that, it isn't in C++ but is some platform-specific type.

There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.

UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.

UTF-8 and UTF-16 are different encodings for the same character set, so UTF-16 is not "bigger". Technically, the scheme used in UTF-8 could encode more characters than the scheme used in UTF-16, but as UTF-8 and UTF-16 they encode the same set.

Don't use the term "character" lightly when it comes to unicode. A codeunit in UTF-16 is 2 bytes wide, a codepoint is represented by 1 or 2 codeunits. What humans usually understand as "characters" is different and can be composed of one or more codepoints, and if you as a programmer confuse codepoints with characters bad things can happen like http://ideone.com/qV2il

0 讨论(0)

查看其它8个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复