platform specific Unicode semantics in Python 2.7

前端 未结 3 2018
陌清茗
陌清茗 2020-12-20 03:53

Ubuntu 11.10:

$ python
Python 2.7.2+ (default, Oct  4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type \"help\", \"copyright\", \"credits\" or \"license\" for more         


        
3条回答
  •  心在旅途
    2020-12-20 04:30

    I primarily needed to accurately test length. Hence this function that correctly returns the codepoint length of any unicode string, whether the interpreter is narrow or wide built. If the data uses two surrogate literals instead of a single \U-style code point in a wide-built interpreter, the returned codepoint length will account for that as long as the surrogates are used "correctly", i.e. as a narrow-built interpreter would use them.

    invoke = lambda f: f()  # trick borrowed from Node.js
    
    @invoke
    def ulen():
      testlength = len(u'\U00010000')
      assert (testlength == 1) or (testlength == 2)
      if testlength == 1:  # "wide" interpreters
        def closure(data):
          u'returns the number of Unicode code points in a unicode string'
          return len(data.encode('UTF-16BE').decode('UTF-16BE'))
      else:  # "narrow" interpreters
        def filt(c):
          ordc = ord(c)
          return (ordc >= 55296) and (ordc < 56320)
        def closure(data):
          u'returns the number of Unicode code points in a unicode string'
          return len(data) - len(filter(filt, data))
      return closure  # ulen() body is therefore different on narrow vs wide builds
    

    Test case, passes on narrow and wide builds:

    class TestUlen(TestCase):
    
      def test_ulen(self):
        self.assertEquals(ulen(u'\ud83d\udc4d'), 1)
        self.assertEquals(ulen(u'\U0001F44D'), 1)
    

提交回复
热议问题