If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?

前端未结

关注

 3  789

北海茫月 2020-12-17 19:24

On the Unicode site it\'s written that UTF-8 can be represented by 1-4 bytes. As I understand from this question https://softwareengineering.stackexchange.com/questions/7775

3条回答

一向 (楼主)

2020-12-17 19:57
UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.

To understand this further, Unicode treats characters as codepoints - a mere number that can be represented in multiple ways (the encodings). UTF-8 is one such encoding. It is most commonly used, for it gives the best space consumption characteristics among all encodings. If you are storing characters from the ASCII character set in UTF-8 encoding, then the UTF-8 encoded data will take the same amount of space. This allowed for applications that previously used ASCII to seamlessly move (well, not quite, but it certainly didn't result in something like Y2K) to Unicode, for the character representations are the same.

I'll leave this extract here from RFC 3629, on how the UTF-8 encoding would work:
```
   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
```
You'll notice why the encoding will result in characters occupying anywhere between 1 and 4 bytes (the right-hand column) for different ranges of characters in Unicode (the left-hand column).

UTF-16, UTF-32, UCS-2 etc. will employ different encoding schemes where the codepoints would represented as 16-bit or 32-bit codes, instead of 8-bit codes that UTF-8 does.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...