What is the difference between UTF-8 and Unicode?

前端 未结 15 1184
独厮守ぢ
独厮守ぢ 2020-11-22 17:08

I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page.

They are the same thing, aren\'t they? Can someone clarify?

15条回答
  •  臣服心动
    2020-11-22 17:25

    Let me use an example to illustrate this topic:

    A chinese character:      汉
    it's unicode value:       U+6C49
    convert 6C49 to binary:   01101100 01001001
    

    Nothing magical so far, it's very simple. Now, let's say we decide to store this character on our hard drive. To do that, we need to store the character in binary format. We can simply store it as is '01101100 01001001'. Done!

    But wait a minute, is '01101100 01001001' one character or two characters? You knew this is one character because I told you, but when a computer reads it, it has no idea. So we need some sort of "encoding" to tell the computer to treat it as one.

    This is where the rules of 'UTF-8' comes in: http://www.fileformat.info/info/unicode/utf8.htm

    Binary format of bytes in sequence
    
    1st Byte    2nd Byte    3rd Byte    4th Byte    Number of Free Bits   Maximum Expressible Unicode Value
    0xxxxxxx                                                7             007F hex (127)
    110xxxxx    10xxxxxx                                (5+6)=11          07FF hex (2047)
    1110xxxx    10xxxxxx    10xxxxxx                  (4+6+6)=16          FFFF hex (65535)
    11110xxx    10xxxxxx    10xxxxxx    10xxxxxx    (3+6+6+6)=21          10FFFF hex (1,114,111)
    

    According to the table above, if we want to store this character using the 'UTF-8' format, we need to prefix our character with some 'headers'. Our chinese character is 16 bits long (count the binary value yourself), so we will use the format on row 3 as it provides enough space:

    Header  Place holder    Fill in our Binary   Result         
    1110    xxxx            0110                 11100110
    10      xxxxxx          110001               10110001
    10      xxxxxx          001001               10001001
    

    Writing out the result in one line:

    11100110 10110001 10001001
    

    This is the UTF-8 (binary) value of the chinese character! (confirm it yourself: http://www.fileformat.info/info/unicode/char/6c49/index.htm)

    Summary

    A chinese character:      汉
    it's unicode value:       U+6C49
    convert 6C49 to binary:   01101100 01001001
    embed 6C49 as UTF-8:      11100110 10110001 10001001
    

    P.S. If you want to learn this topic in python, click here

提交回复
热议问题