platform specific Unicode semantics in Python 2.7

时光总嘲笑我的痴心妄想 提交于 2019-11-29 12:21:29

On Ubuntu, you have a "wide" Python build where strings are UTF-32/UCS-4. Unfortunately, this isn't (yet) available for Windows.

Windows builds will be narrow for a while based on the fact that there have been few requests for wide characters, those requests are mostly from hard-core programmers with the ability to buy their own Python and Windows itself is strongly biased towards 16-bit characters.

Python 3.3 will have flexible string representation, in which you will not need to care about whether Unicode strings use 16-bit or 32-bit code units.

Until then, you can get the code points from a UTF-16 string with

def code_points(text):
    utf32 = text.encode('UTF-32LE')
    return struct.unpack('<{}I'.format(len(utf32) // 4), utf32)

great question! i fell down this rabbit hole recently myself.

@dan04's answer inspired me to expand it into a unicode subclass that provides consistent indexing, slicing, and len() on both narrow and wide Python 2 builds:

class WideUnicode(unicode):
  """String class with consistent indexing, slicing, len() on both narrow and wide Python."""
  def __init__(self, *args, **kwargs):
    super(WideUnicode, self).__init__(*args, **kwargs)
    # use UTF-32LE to avoid a byte order marker at the beginning of the string
    self.__utf32le = unicode(self).encode('utf-32le')

  def __len__(self):
    return len(self.__utf32le) / 4

  def __getitem__(self, key):
    length = len(self)

    if isinstance(key, int):
      if key >= length:
        raise IndexError()
      key = slice(key, key + 1)

    if key.stop is None:
      key.stop = length

    assert key.step is None

    return WideUnicode(self.__utf32le[key.start * 4:key.stop * 4]
                       .decode('utf-32le'))

  def __getslice__(self, i, j):
    return self.__getitem__(slice(i, j))

open sourced here, public domain. example usage:

text = WideUnicode(obj.text)
for tag in obj.tags:
  text = WideUnicode(text[:start] + tag.text + text[end:])

(simplified from this usage.)

thanks @dan04!

I primarily needed to accurately test length. Hence this function that correctly returns the codepoint length of any unicode string, whether the interpreter is narrow or wide built. If the data uses two surrogate literals instead of a single \U-style code point in a wide-built interpreter, the returned codepoint length will account for that as long as the surrogates are used "correctly", i.e. as a narrow-built interpreter would use them.

invoke = lambda f: f()  # trick borrowed from Node.js

@invoke
def ulen():
  testlength = len(u'\U00010000')
  assert (testlength == 1) or (testlength == 2)
  if testlength == 1:  # "wide" interpreters
    def closure(data):
      u'returns the number of Unicode code points in a unicode string'
      return len(data.encode('UTF-16BE').decode('UTF-16BE'))
  else:  # "narrow" interpreters
    def filt(c):
      ordc = ord(c)
      return (ordc >= 55296) and (ordc < 56320)
    def closure(data):
      u'returns the number of Unicode code points in a unicode string'
      return len(data) - len(filter(filt, data))
  return closure  # ulen() body is therefore different on narrow vs wide builds

Test case, passes on narrow and wide builds:

class TestUlen(TestCase):

  def test_ulen(self):
    self.assertEquals(ulen(u'\ud83d\udc4d'), 1)
    self.assertEquals(ulen(u'\U0001F44D'), 1)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!