What is a multibyte character set?

前端 未结 9 1193
时光说笑
时光说笑 2020-12-03 04:50

Does the term multibyte refer to a charset whose characters can - but don\'t have to be - wider than 1 byte, (e.g. UTF-8) or does it refer to character sets which are in any

相关标签:
9条回答
  • 2020-12-03 05:08

    Typically the former, i.e. UTF-8-like. For more info, see Variable-width encoding.

    0 讨论(0)
  • 2020-12-03 05:09

    All character sets where you dont have a 1 byte = 1 character mapping. All Unicode variants, but also asian character sets are multibyte.

    For more information, I suggest reading this Wikipedia article.

    0 讨论(0)
  • 2020-12-03 05:12

    The term is ambiguous, but in my internationalization work, we typically avoided the term "multibyte character sets" to refer to Unicode-based encodings. Generally, we used the term only for legacy encoding schemes that had one or more bytes to define each character (excluding encodings that require only one byte per character).

    Shift-jis, jis, euc-jp, euc-kr, along with Chinese encodings are typically included.

    Most of the legacy encodings, with some exceptions, require a sort of state machine model (or, more simply, a page swapping model) to process, and moving backwards in a text stream is complicated and error-prone. UTF-8 and UTF-16 do not suffer from this problem, as UTF-8 can be tested with a bitmask and UTF-16 can be tested against a range of surrogate pairs, so moving backward and forward in a non-pathological document can be done safely without major complexity.

    A few legacy encodings, for languages like Thai and Vietnamese, have some of the complexity of multibyte character sets but are really just built on combining characters, and aren't generally lumped in with the broad term "multibyte."

    0 讨论(0)
提交回复
热议问题