Python returns length of 2 for single Unicode character string

后端 未结 1 1512
梦谈多话
梦谈多话 2020-12-01 19:40

In Python 2.7:

In [2]: utf8_str = \'\\xf0\\x9f\\x91\\x8d\'
In [3]: print(utf8_str)
         


        
相关标签:
1条回答
  • 2020-12-01 20:07

    Your Python binary was compiled with UCS-2 support (a narrow build) and internally anything outside of the BMP (Basic Multilingual Plane) is represented using a surrogate pair.

    That means such codepoints show up as 2 characters when asking for the length.

    You'll have to recompile your Python binary to use UCS-4 instead if this matters (./configure --enable-unicode=ucs4 will enable it), or upgrade to Python 3.3 or newer, where Python's Unicode support was overhauled to use a variable-width Unicode type that switches between ASCII, UCS-2 and UCS-4 as required by the codepoints contained.

    On Python versions 2.7 and 3.0 - 3.2, you can detect what kind of build you have by inspecting the sys.maxunicode value; it'll be 2^16-1 == 65535 == 0xFFFF for a narrow UCS-2 build, 1114111 == 0x10FFFF for a wide UCS-4 build. In Python 3.3 and up it is always set to 1114111.

    Demo:

    # Narrow build
    $ bin/python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
    65535 2 [u'\ud83d', u'\udc4d']
    # Wide build
    $ python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
    1114111 1 [u'\U0001f44d']
    
    0 讨论(0)
提交回复
热议问题