What is a multibyte character set?

前端未结

关注

 9  1192

Does the term multibyte refer to a charset whose characters can - but don\'t have to be - wider than 1 byte, (e.g. UTF-8) or does it refer to character sets which are in any

相关标签:

9条回答

悲哀的现实

2020-12-03 04:53
A multibyte character will mean a character whose encoding requires more than 1 byte. This does not imply however that all characters using that particular encoding will have the same width (in terms of bytes). E.g: UTF-8 and UTF-16 encoded character may use multiple bytes sometimes whereas all UTF-32 encoded characters always use 32-bits.

References:
- IBM: Multibyte Characters
- Unicode and MultiByte Character Set
- Unicode Consortium Website
0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2020-12-03 04:56

The former - although the term "variable-length encoding" would be more appropriate.

0 讨论(0)
发布评论:

提交评论
- 加载中...
后悔当初

2020-12-03 04:58

What is meant if anybody talks about multibyte character sets?

That, as usual, depends on who is doing the talking!

Logically, it should include UTF-8, Shift-JIS, GB etc.: the variable-length encodings. UTF-16 would often not be considered in this group (even though it kind of is, what with the surrogates; and certainly it's multiple bytes when encoded into bytes via UTF-16LE/UTF-16BE).

But in Microsoftland the term would more typically be used to mean a variable-length default system codepage (for legacy non-Unicode applications, of which there are sadly still plenty). In this usage, UTF-8 and UTF-16LE/UTF-16BE cannot be included because the system codepage on Windows cannot be set to either of these encodings.

Indeed, in some cases “mbcs” is no more than a synonym for the system codepage, otherwise known (even more misleadingly) as “ANSI”. In this case a “multibyte” character set could actually be something as trivial as cp1252 Western European, which only uses one byte per character!

My advice: use “variable-length” when you mean that, and avoid the ambiguous term “multibyte”; when someone else uses it you'll need to ask for clarification, but typically someone with a Windows background will be talking about a legacy East Asian codepage like cp932 (Shift-JIS) and not a UTF.

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2020-12-03 05:02

I generally use it to refer to any character that can have more than one byte per character.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-12-03 05:05

UTF-8 is multi-byte, which means that each English character (ASCII) is stored in 1 byte while non-english character like Chinese, Thai, is stored in 3 bytes. When you mix Chinese/Thai with English, like "ทt", the first Thai character "ท" uses 3 bytes while the second English character "t" uses only 1 byte. People who designed multi-byte encoding realized that English character shouldn't be stored in 3 bytes while it can fit in 1 byte due to the waste of storage space.

UTF-16 stores each character either English or non-English in a fixed 2 byte length so it is not multi-byte but called a wide character. It is very suitable for Chinese/Thai languages where each character fits entirely in 2 bytes but printing to utf-8 console output need a conversion from wide character to multi-byte format by using function wcstombs().

UTF-32 stores each character in a fixed 4 byte length but nobody use it to store character due to a waste of storage space.

0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2020-12-03 05:07

A multibyte character set may consist of both one-byte and two-byte characters. Thus a multibyte-character string may contain a mixture of single-byte and double-byte characters.

Ref: Single-Byte and Multibyte Character Sets

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页