platform specific Unicode semantics in Python 2.7

前端未结

关注

 3  2017

Ubuntu 11.10:

$ python
Python 2.7.2+ (default, Oct  4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type \"help\", \"copyright\", \"credits\" or \"license\" for more


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2020-12-20 04:04
              
            
            
                                                                       
On Ubuntu, you have a "wide" Python build where strings are UTF-32/UCS-4.  Unfortunately, this isn't (yet) available for Windows.


  Windows builds will be narrow for a while based on the fact that there
  have been few requests for wide characters, those requests are mostly
  from hard-core programmers with the ability to buy their own Python
  and Windows itself is strongly biased towards 16-bit characters.


Python 3.3 will have flexible string representation, in which you will not need to care about whether Unicode strings use 16-bit or 32-bit code units.

Until then, you can get the code points from a UTF-16 string with

def code_points(text):
    utf32 = text.encode('UTF-32LE')
    return struct.unpack('<{}I'.format(len(utf32) // 4), utf32)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2020-12-20 04:23
              
            
            
                                                                       
great question! i fell down this rabbit hole recently myself.

@dan04's answer inspired me to expand it into a unicode subclass that provides consistent indexing, slicing, and len() on both narrow and wide Python 2 builds:

class WideUnicode(unicode):
  """String class with consistent indexing, slicing, len() on both narrow and wide Python."""
  def __init__(self, *args, **kwargs):
    super(WideUnicode, self).__init__(*args, **kwargs)
    # use UTF-32LE to avoid a byte order marker at the beginning of the string
    self.__utf32le = unicode(self).encode('utf-32le')

  def __len__(self):
    return len(self.__utf32le) / 4

  def __getitem__(self, key):
    length = len(self)

    if isinstance(key, int):
      if key >= length:
        raise IndexError()
      key = slice(key, key + 1)

    if key.stop is None:
      key.stop = length

    assert key.step is None

    return WideUnicode(self.__utf32le[key.start * 4:key.stop * 4]
                       .decode('utf-32le'))

  def __getslice__(self, i, j):
    return self.__getitem__(slice(i, j))


open sourced here, public domain. example usage:

text = WideUnicode(obj.text)
for tag in obj.tags:
  text = WideUnicode(text[:start] + tag.text + text[end:])


(simplified from this usage.)

thanks @dan04!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  心在旅途        
                
              
                            
                2020-12-20 04:30
              
            
            
                                                                       
I primarily needed to accurately test length.  Hence this function that correctly returns the codepoint length of any unicode string, whether the interpreter is narrow or wide built.  If the data uses two surrogate literals instead of a single \U-style code point in a wide-built interpreter, the returned codepoint length will account for that as long as the surrogates are used "correctly", i.e. as a narrow-built interpreter would use them.

invoke = lambda f: f()  # trick borrowed from Node.js

@invoke
def ulen():
  testlength = len(u'\U00010000')
  assert (testlength == 1) or (testlength == 2)
  if testlength == 1:  # "wide" interpreters
    def closure(data):
      u'returns the number of Unicode code points in a unicode string'
      return len(data.encode('UTF-16BE').decode('UTF-16BE'))
  else:  # "narrow" interpreters
    def filt(c):
      ordc = ord(c)
      return (ordc >= 55296) and (ordc < 56320)
    def closure(data):
      u'returns the number of Unicode code points in a unicode string'
      return len(data) - len(filter(filt, data))
  return closure  # ulen() body is therefore different on narrow vs wide builds


Test case, passes on narrow and wide builds:

class TestUlen(TestCase):

  def test_ulen(self):
    self.assertEquals(ulen(u'\ud83d\udc4d'), 1)
    self.assertEquals(ulen(u'\U0001F44D'), 1)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复