Escaping unicode strings in python

前端 未结 4 1450

In python these three commands print the same emoji:

print \"\\xF0\\x9F\\x8C\\x80\"
         


        
4条回答
  •  温柔的废话
    2021-01-03 05:00

    Your first string is a byte string. The fact that it prints a single emoji character means that your console is configured to print UTF-8 encoded characters.

    Your second string is a Unicode string with a single codepoint, U+1F300. The \U specifies that the next 8 hex digits should be interpreted as a codepoint.

    The third string takes advantage of a quirk in the way Unicode strings are stored in Python 2. You've given two UTF-16 entities, which together form the single codepoint U+1F300 the same as the previous string. Each \u takes 4 following hex digits. Individually these characters wouldn't be valid Unicode, but because Python 2 stores its Unicode internally as UTF-16 it works out. In Python 3 this wouldn't be valid.

    When you print out a Unicode string, and your console encoding is known to be UTF-8, the Unicode strings are encoded to UTF-8 bytes. Thus the 3 strings end up producing the same byte sequence on the output, generating the same character.

提交回复
热议问题