Why is the output of print in python2 and python3 different with the same string?

后端 未结 2 748
再見小時候
再見小時候 2020-12-06 11:17

In python2:

$ python2 -c \'print \"\\x08\\x04\\x87\\x18\"\' | hexdump -C
00000000  08 04 87 18 0a                                    |.....|
00000005
         


        
2条回答
  •  天涯浪人
    2020-12-06 11:58

    Consider the following snippet of code:

    import sys
    for i in range(128, 256):
        sys.stdout.write(chr(i))
    

    Run this with Python 2 and look at the result with hexdump -C:

    00000000  80 81 82 83 84 85 86 87  88 89 8a 8b 8c 8d 8e 8f  |................|
    

    Et cetera. No surprises; 128 bytes from 0x80 to 0xff.

    Do the same with Python 3:

    00000000  c2 80 c2 81 c2 82 c2 83  c2 84 c2 85 c2 86 c2 87  |................|
    ...
    00000070  c2 b8 c2 b9 c2 ba c2 bb  c2 bc c2 bd c2 be c2 bf  |................|
    00000080  c3 80 c3 81 c3 82 c3 83  c3 84 c3 85 c3 86 c3 87  |................|
    ...
    000000f0  c3 b8 c3 b9 c3 ba c3 bb  c3 bc c3 bd c3 be c3 bf  |................|
    

    To summarize:

    • Everything from 0x80 to 0xbf has 0xc2 prepended.
    • Everything from 0xc0 to 0xff has bit 6 set to zero and has 0xc3 prepended.

    So, what’s going on here?

    In Python 2, strings are ASCII and no conversion is done. Tell it to write something outside the 0-127 ASCII range, it says “okey-doke!” and just writes those bytes. Simple.

    In Python 3, strings are Unicode. When non-ASCII characters are written, they must be encoded in some way. The default encoding is UTF-8.

    So, how are these values encoded in UTF-8?

    Code points from 0x80 to 0x7ff are encoded as follows:

    110vvvvv 10vvvvvv
    

    Where the 11 v characters are the bits of the code point.

    Thus:

    0x80                 hex
    1000 0000            8-bit binary
    000 1000 0000        11-bit binary
    00010 000000         divide into vvvvv vvvvvv
    11000010 10000000    resulting UTF-8 octets in binary
    0xc2 0x80            resulting UTF-8 octets in hex
    
    0xc0                 hex
    1100 0000            8-bit binary
    000 1100 0000        11-bit binary
    00011 000000         divide into vvvvv vvvvvv
    11000011 10000000    resulting UTF-8 octets in binary
    0xc3 0x80            resulting UTF-8 octets in hex
    

    So that’s why you’re getting a c2 before 87.

    How to avoid all this in Python 3? Use the bytes type.

提交回复
热议问题