How to iterate over Unicode characters in Python 3?

前端 未结 3 921
花落未央
花落未央 2020-12-10 03:48

I need to step through a Python string one character at a time, but a simple \"for\" loop gives me UTF-16 code units instead:

str = \"abc\\u20ac\\U00010302\\         


        
3条回答
  •  死守一世寂寞
    2020-12-10 04:22

    Python normally stores the unicode values internally as UCS2. The UTF-16 representation of the UTF-32 \U00010302 character is \UD800\UDF02, that's why you got that result.

    That said, there are some python builds that use UCS4, but these builds are not compatible with each other.

    Take a look here.

    Py_UNICODE This type represents the storage type which is used by Python internally as basis for holding Unicode ordinals. Python’s default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. It is also possible to build a UCS4 version of Python (most recent Linux distributions come with UCS4 builds of Python). These builds then use a 32-bit type for Py_UNICODE and store Unicode data internally as UCS4. On platforms where wchar_t is available and compatible with the chosen Python Unicode build variant, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for either unsigned short (UCS2) or unsigned long (UCS4).

提交回复
热议问题