platform specific Unicode semantics in Python 2.7

Ubuntu 11.10:

$ python
Python 2.7.2+ (default, Oct  4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = u'\U0001f44d'
>>> len(x)
1
>>> ord(x[0])
128077

Windows 7:

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x = u'\U0001f44d'
>>> len(x)
2
>>> ord(x[0])
55357

My Ubuntu experience is with the default interpreter in the distribution. For Windows 7 I downloaded and installed the recommended version linked from python.org. I did not compile either of them myself.

The nature of the difference is clear to me. (On Ubuntu the string is a sequence of code points; on Windows 7 a sequence of UTF-16 code units.) My questions are:

Why am I observing this difference in behavior? Is it due to how the interpreter is built, or a difference in dependent system libraries?
Is there any way to configure the behavior of the Windows 7 interpreter to agree with the Ubuntu one, that I can do within Eclipse PyDev (my goal)?
If I have to rebuild, are there any prebuilt Windows 7 interpreters that behave as Ubuntu above from a reliable source?
Are there any workarounds to this issue besides manually counting surrogates in unicode strings on Windows only (blech)?
Does this justify a bug report? Is there any chance such a bug report would be addressed in 2.7?

On Ubuntu, you have a "wide" Python build where strings are UTF-32/UCS-4. Unfortunately, this isn't (yet) available for Windows.

Windows builds will be narrow for a while based on the fact that there have been few requests for wide characters, those requests are mostly from hard-core programmers with the ability to buy their own Python and Windows itself is strongly biased towards 16-bit characters.

Python 3.3 will have flexible string representation, in which you will not need to care about whether Unicode strings use 16-bit or 32-bit code units.

Until then, you can get the code points from a UTF-16 string with

def code_points(text):
    utf32 = text.encode('UTF-32LE')
    return struct.unpack('<{}I'.format(len(utf32) // 4), utf32)

great question! i fell down this rabbit hole recently myself.

@dan04's answer inspired me to expand it into a unicode subclass that provides consistent indexing, slicing, and len() on both narrow and wide Python 2 builds:

class WideUnicode(unicode):
  """String class with consistent indexing, slicing, len() on both narrow and wide Python."""
  def __init__(self, *args, **kwargs):
    super(WideUnicode, self).__init__(*args, **kwargs)
    # use UTF-32LE to avoid a byte order marker at the beginning of the string
    self.__utf32le = unicode(self).encode('utf-32le')

  def __len__(self):
    return len(self.__utf32le) / 4

  def __getitem__(self, key):
    length = len(self)

    if isinstance(key, int):
      if key >= length:
        raise IndexError()
      key = slice(key, key + 1)

    if key.stop is None:
      key.stop = length

    assert key.step is None

    return WideUnicode(self.__utf32le[key.start * 4:key.stop * 4]
                       .decode('utf-32le'))

  def __getslice__(self, i, j):
    return self.__getitem__(slice(i, j))

open sourced here, public domain. example usage:

text = WideUnicode(obj.text)
for tag in obj.tags:
  text = WideUnicode(text[:start] + tag.text + text[end:])

(simplified from this usage.)

thanks @dan04!

I primarily needed to accurately test length. Hence this function that correctly returns the codepoint length of any unicode string, whether the interpreter is narrow or wide built. If the data uses two surrogate literals instead of a single \U-style code point in a wide-built interpreter, the returned codepoint length will account for that as long as the surrogates are used "correctly", i.e. as a narrow-built interpreter would use them.

invoke = lambda f: f()  # trick borrowed from Node.js

@invoke
def ulen():
  testlength = len(u'\U00010000')
  assert (testlength == 1) or (testlength == 2)
  if testlength == 1:  # "wide" interpreters
    def closure(data):
      u'returns the number of Unicode code points in a unicode string'
      return len(data.encode('UTF-16BE').decode('UTF-16BE'))
  else:  # "narrow" interpreters
    def filt(c):
      ordc = ord(c)
      return (ordc >= 55296) and (ordc < 56320)
    def closure(data):
      u'returns the number of Unicode code points in a unicode string'
      return len(data) - len(filter(filt, data))
  return closure  # ulen() body is therefore different on narrow vs wide builds

Test case, passes on narrow and wide builds:

class TestUlen(TestCase):

  def test_ulen(self):
    self.assertEquals(ulen(u'\ud83d\udc4d'), 1)
    self.assertEquals(ulen(u'\U0001F44D'), 1)

来源：https://stackoverflow.com/questions/9934752/platform-specific-unicode-semantics-in-python-2-7

标签

python

windows

unicode

utf-16