What's the difference between an “encoding,” a “character set,” and a “code page”?

后端 未结 5 1550
北恋
北恋 2020-12-12 20:14

I\'m really trying to get better with this stuff. I\'m pretty functional with internationalization concepts like this, but I need to get a better background on the theory b

5条回答
  •  情深已故
    2020-12-12 21:00

    A character set is a set of characters, i.e. "glyphs" i.e. visual symbols representing units of communication. The letter a is a glyph and so is (euro sign). Character sets usually map integers (codepoints) to each character, but it's the encoding that dictates the binary representation of the character.

    I'm a ruby programmer, so here are some examples to help you understand the concepts.

    This reveals how Unicode maps codepoints to characters, but not how each byte is stored. (ruby 1.9 defaults to Unicode strings.)

    >> 'a'.codepoints.to_a
    => [97]
    >> '€'.codepoints.to_a
    => [8364]
    

    The following reveals how the UTF-8 encoding stores each character as bytes (0 thru 255 in base-10). (Ruby 1.9's default encoding is UTF-8.) Since 8364 (base 10) is too large to fit in one byte, UTF-8 has a specific strategy for breaking it into multiple bytes. Wikipedia shows the UTF-8 encoding algorithm, if you want to delve into the implementation.

    >> 'a'.bytes.to_a
    => [97]
    >> '€'.bytes.to_a
    => [226, 130, 172]
    

    Here's the same thing in ISO-8859-15 char set:

    >> 'a'.encode('iso-8859-15').codepoints.to_a
    => [97]
    >> '€'.encode('iso-8859-15').codepoints.to_a
    => [164]
    

    And the ISO-8859-15 encoding:

    >> 'a'.encode('iso-8859-15').bytes.to_a
    => [97]
    >> '€'.encode('iso-8859-15').bytes.to_a
    => [164]
    

    Notice that the ISO-8859-15 codepoints match the byte representation.

    Here's a blog entry that might be helpful: http://blog.grayproductions.net/articles/what_is_a_character_encoding . Entries 1 thru 3 are good if you don't want to get too ruby-specific.

提交回复
热议问题