Ubuntu 11.10:
$ python
Python 2.7.2+ (default, Oct 4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type \"help\", \"copyright\", \"credits\" or \"license\" for more
I primarily needed to accurately test length. Hence this function that correctly returns the codepoint length of any unicode string, whether the interpreter is narrow or wide built. If the data uses two surrogate literals instead of a single \U-style code point in a wide-built interpreter, the returned codepoint length will account for that as long as the surrogates are used "correctly", i.e. as a narrow-built interpreter would use them.
invoke = lambda f: f() # trick borrowed from Node.js
@invoke
def ulen():
testlength = len(u'\U00010000')
assert (testlength == 1) or (testlength == 2)
if testlength == 1: # "wide" interpreters
def closure(data):
u'returns the number of Unicode code points in a unicode string'
return len(data.encode('UTF-16BE').decode('UTF-16BE'))
else: # "narrow" interpreters
def filt(c):
ordc = ord(c)
return (ordc >= 55296) and (ordc < 56320)
def closure(data):
u'returns the number of Unicode code points in a unicode string'
return len(data) - len(filter(filt, data))
return closure # ulen() body is therefore different on narrow vs wide builds
Test case, passes on narrow and wide builds:
class TestUlen(TestCase):
def test_ulen(self):
self.assertEquals(ulen(u'\ud83d\udc4d'), 1)
self.assertEquals(ulen(u'\U0001F44D'), 1)