How do I convert a unicode to a string at the Python level?

后端未结

关注

 7  1315

The following unicode and string can exist on their own if defined explicitly:

>>> value_str=\'Andr\\xc3\\xa9\'
>>> value_uni=u\'Andr\\xc3\


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2020-12-09 18:20
              
            
            
                                                                       
Simplified explanation. The str type is able to hold only characters from 0-255 range. If you want to store unicode (which can contain characters from much wider range) in str you first have to encode unicode to format suitable for str, for example UTF-8.

To do this call method encode on your str object and as an argument give desired encoding, for example this_is_str = value_uni.encode('utf-8').

You can read longer and more in-depth (and language agnostic) article on Unicode handling here: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Another excellent article (this time Python-specific): Unicode HOWTO
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光取名叫无心        
                
              
                            
                2020-12-09 18:25
              
            
            
                                                                       
value_uni.encode('utf8') or whatever encoding you need.

See http://docs.python.org/library/stdtypes.html#str.encode
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2020-12-09 18:29
              
            
            
                                                                       
It seems like

str(value_uni)


should work... at least, it did when I tried it.

EDIT: Turns out that this only works because my system's default encoding is, as far as I can tell, ISO-8859-1 (Latin-1).  So for a platform-independent version of this, try

value_uni.encode('latin1')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2020-12-09 18:32
              
            
            
                                                                       
If you have u'Andr\xc3\xa9', that is a Unicode string that was decoded from a byte string with the wrong encoding.  The correct encoding is UTF-8.  To convert it back to a byte string so you can decode it correctly, you can use the trick you discovered.  The first 256 code points of Unicode are a 1:1 mapping with ISO-8859-1 (alias latin1) encoding.  So:

>>> u'Andr\xc3\xa9'.encode('latin1')
'Andr\xc3\xa9'


Now it is a byte string that can be decoded correctly with utf8:

>>> 'Andr\xc3\xa9'.decode('utf8')
u'Andr\xe9'
>>> print 'Andr\xc3\xa9'.decode('utf8')
André


In one step:

>>> print u'Andr\xc3\xa9'.encode('latin1').decode('utf8')
André

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  予麋鹿        
                
              
                            
                2020-12-09 18:36
              
            
            
                                                                       
The OP is not converting to ascii nor utf-8. That's why the suggested encode methods won't work. Try this:

v = u'Andr\xc3\xa9'
s = ''.join(map(lambda x: chr(ord(x)),v))


The chr(ord(x)) business gets the numeric value of the unicode character (which better fit in one byte for your application), and the ''.join call is an idiom that  converts a list of ints back to an ordinary string. No doubt there is a more elegant way.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2020-12-09 18:39
              
            
            
                                                                       
You seem to have gotten your encodings muddled up. It seems likely that what you really want is u'Andr\xe9' which is equivalent to 'André'.

But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work:

>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9')
'Andr\xc3\xa9'


Then decode it correctly:

>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8')
u'Andr\xe9'    


Now it is in the correct format.

However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复