platform specific Unicode semantics in Python 2.7

前端 未结 3 2019
陌清茗
陌清茗 2020-12-20 03:53

Ubuntu 11.10:

$ python
Python 2.7.2+ (default, Oct  4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type \"help\", \"copyright\", \"credits\" or \"license\" for more         


        
3条回答
  •  南笙
    南笙 (楼主)
    2020-12-20 04:04

    On Ubuntu, you have a "wide" Python build where strings are UTF-32/UCS-4. Unfortunately, this isn't (yet) available for Windows.

    Windows builds will be narrow for a while based on the fact that there have been few requests for wide characters, those requests are mostly from hard-core programmers with the ability to buy their own Python and Windows itself is strongly biased towards 16-bit characters.

    Python 3.3 will have flexible string representation, in which you will not need to care about whether Unicode strings use 16-bit or 32-bit code units.

    Until then, you can get the code points from a UTF-16 string with

    def code_points(text):
        utf32 = text.encode('UTF-32LE')
        return struct.unpack('<{}I'.format(len(utf32) // 4), utf32)
    

提交回复
热议问题