Python get character code in different encoding?

后端 未结 3 1657
情书的邮戳
情书的邮戳 2021-02-04 06:05

Given a character code as integer number in one encoding, how can you get the character code in, say, utf-8 and again as integer?

3条回答
  •  萌比男神i
    2021-02-04 06:55

    You can only map an "integer number" from one encoding to another if they are both single-byte encodings.

    Here's an example using "iso-8859-15" and "cp1252" (aka "ANSI"):

    >>> s = u'€'
    >>> s.encode('iso-8859-15')
    '\xa4'
    >>> s.encode('cp1252')
    '\x80'
    >>> ord(s.encode('cp1252'))
    128
    >>> ord(s.encode('iso-8859-15'))
    164
    

    Note that ord is here being used to get the ordinal number of the encoded byte. Using ord on the original unicode string would give its unicode code point:

    >>> ord(s)
    8364
    

    The reverse operation to ord can be done using either chr (for codes in the range 0 to 127) or unichr (for codes in the range 0 to sys.maxunicode):

    >>> print chr(65)
    A
    >>> print unichr(8364)
    €
    

    For multi-byte encodings, a simple "integer number" mapping is usually not possible.

    Here's the same example as above, but using "iso-8859-15" and "utf-8":

    >>> s = u'€'
    >>> s.encode('iso-8859-15')
    '\xa4'
    >>> s.encode('utf-8')
    '\xe2\x82\xac'
    >>> [ord(c) for c in s.encode('iso-8859-15')]
    [164]
    >>> [ord(c) for c in s.encode('utf-8')]
    [226, 130, 172]
    

    The "utf-8" encoding uses three bytes to encode the same character, so a one-to-one mapping is not possible. Having said that, many encodings (including "utf-8") are designed to be ASCII-compatible, so a mapping is usually possible for codes in the range 0-127 (but only trivially so, because the code will always be the same).

提交回复
热议问题