How to decode unicode in a Chinese text

前端未结

关注

 4  1496

情书的邮戳 2021-01-01 05:49

with open(\'result.txt\', \'r\') as f:
data = f.read()

print \'What type is my data:\'
print type(data)

for i in data:
    print \"what is i:\"
    print i
    pri


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   [愿得一人]
                                             
                
                
                (楼主)
            
              
              
                2021-01-01 06:40
              

            
            
                        
When you call encode on a str with most (all?) codecs (for which encode really makes no sense; str is a byte oriented type, not a true text type like unicode that would require encoding), Python is implicitly decodeing it as ASCII first, then encoding with your specified encoding. If you want the str to be interpreted as something other than ASCII, you need to decode from bytes-like str to true text unicode yourself.

When you do i.encode('utf-8') when i is a str, you're implicitly saying i is logically text (represented by bytes in the locale default encoding), not binary data. So in order to encode it, it first needs to decode it to determine what the "logical" text is. Your input is probably encoded in some ASCII superset (e.g. latin-1, or even utf-8), and contains non-ASCII bytes; it tries to decode them using the ascii codec (to figure out the true Unicode ordinals it needs to encode as utf-8), and fails.

You need to do one of:


Explicitly decode the str you read using the correct codec (to get a unicode object), then encode that back to utf-8.
Let Python do the work from #1 for you implicitly. Instead of using open, import io and use io.open (Python 2.7+ only; on Python 3+, io.open and open are the same function), which gets you an open that works like Python 3's open. You can pass this open an encoding argument (e.g. io.open('/path/to/file', 'r', encoding='latin-1')) and reading from the resulting file object will get you already decode-ed unicode objects (that can then be encode-ed to whatever you like with).


Note: #1 will not work if the real encoding is something like utf-8 and you defer the work until you're iterating character by character. For non-ASCII characters, utf-8 is multibyte, so if you only have one byte, you can't decode (because the following bytes are needed to calculate a single ordinal). This is a reason to prefer using io.open to read as unicode natively so you're not worrying about stuff like this.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复