What is a multibyte character set?

前端 未结 9 1192
时光说笑
时光说笑 2020-12-03 04:50

Does the term multibyte refer to a charset whose characters can - but don\'t have to be - wider than 1 byte, (e.g. UTF-8) or does it refer to character sets which are in any

相关标签:
9条回答
  • 2020-12-03 04:53

    A multibyte character will mean a character whose encoding requires more than 1 byte. This does not imply however that all characters using that particular encoding will have the same width (in terms of bytes). E.g: UTF-8 and UTF-16 encoded character may use multiple bytes sometimes whereas all UTF-32 encoded characters always use 32-bits.

    References:

    • IBM: Multibyte Characters
    • Unicode and MultiByte Character Set
    • Unicode Consortium Website
    0 讨论(0)
  • 2020-12-03 04:56

    The former - although the term "variable-length encoding" would be more appropriate.

    0 讨论(0)
  • 2020-12-03 04:58

    What is meant if anybody talks about multibyte character sets?

    That, as usual, depends on who is doing the talking!

    Logically, it should include UTF-8, Shift-JIS, GB etc.: the variable-length encodings. UTF-16 would often not be considered in this group (even though it kind of is, what with the surrogates; and certainly it's multiple bytes when encoded into bytes via UTF-16LE/UTF-16BE).

    But in Microsoftland the term would more typically be used to mean a variable-length default system codepage (for legacy non-Unicode applications, of which there are sadly still plenty). In this usage, UTF-8 and UTF-16LE/UTF-16BE cannot be included because the system codepage on Windows cannot be set to either of these encodings.

    Indeed, in some cases “mbcs” is no more than a synonym for the system codepage, otherwise known (even more misleadingly) as “ANSI”. In this case a “multibyte” character set could actually be something as trivial as cp1252 Western European, which only uses one byte per character!

    My advice: use “variable-length” when you mean that, and avoid the ambiguous term “multibyte”; when someone else uses it you'll need to ask for clarification, but typically someone with a Windows background will be talking about a legacy East Asian codepage like cp932 (Shift-JIS) and not a UTF.

    0 讨论(0)
  • 2020-12-03 05:02

    I generally use it to refer to any character that can have more than one byte per character.

    0 讨论(0)
  • 2020-12-03 05:05

    UTF-8 is multi-byte, which means that each English character (ASCII) is stored in 1 byte while non-english character like Chinese, Thai, is stored in 3 bytes. When you mix Chinese/Thai with English, like "ทt", the first Thai character "ท" uses 3 bytes while the second English character "t" uses only 1 byte. People who designed multi-byte encoding realized that English character shouldn't be stored in 3 bytes while it can fit in 1 byte due to the waste of storage space.

    UTF-16 stores each character either English or non-English in a fixed 2 byte length so it is not multi-byte but called a wide character. It is very suitable for Chinese/Thai languages where each character fits entirely in 2 bytes but printing to utf-8 console output need a conversion from wide character to multi-byte format by using function wcstombs().

    UTF-32 stores each character in a fixed 4 byte length but nobody use it to store character due to a waste of storage space.

    0 讨论(0)
  • 2020-12-03 05:07

    A multibyte character set may consist of both one-byte and two-byte characters. Thus a multibyte-character string may contain a mixture of single-byte and double-byte characters.

    Ref: Single-Byte and Multibyte Character Sets

    0 讨论(0)
提交回复
热议问题