Help me understand why Unicode only works sometimes with Python

前端未结

关注

 5  1262

别跟我提以往 2020-12-14 22:11

Here\'s a little program:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-

print(\'abcd kΩ ☠ °C √Hz µF ü ☃ ♥\')  
print(u\'abcd kΩ ☠ °C √Hz µF ü ☃ ♥\')


      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   一个人的身影
                                             
                
                
                (楼主)
            
              
              
                2020-12-14 23:05
              

            
            
                        
I/O in Python (and most other languages) is based on bytes.  When you write a byte string (str in 2.x, bytes in 3.x) to a file, the bytes are simply written as-is.  When you write a Unicode string (unicode in 2.x, str in 3.x) to a file, the data needs to be encoded to a byte sequence.

For a further explanation of this distinction see the Dive into Python 3 chapter on strings.

print('abcd kΩ ☠ °C √Hz µF ü ☃ ♥')


Here, the string is a byte string.  Because the encoding of your source file is UTF-8, the bytes are 

'abcd k\xce\xa9 \xe2\x98\xa0 \xc2\xb0C \xe2\x88\x9aHz \xc2\xb5F \xc3\xbc \xe2\x98\x83 \xe2\x99\xa5'


The print statement writes these bytes to the console as-is.  But the Windows console interprets byte strings as being encoded in the "OEM" code page, which in the US is 437.  So the string you actually see on your screen is 

abcd k╬⌐ Γÿá ┬░C ΓêÜHz ┬╡F ├╝ Γÿâ ΓÖÑ


On your Ubuntu system, this doesn't cause a problem because there the default console encoding is UTF-8, so you don't have the discrepancy between source file encoding and console encoding.

print(u'abcd kΩ ☠ °C √Hz µF ü ☃ ♥')


When printing a Unicode string, the string has to get encoded into bytes.  But it only works if you have an encoding that supports those characters.  And you don't.


The default IBM437 encoding lacks the characters ☠☃♥
The windows-1252 encoding used by Spyder lacks the characters Ω☠√☃♥.


So, in both cases, you get a UnicodeEncodeError trying to print the string.


  What gives?


Windows and Linux took vastly different approaches to supporting Unicode.

Originally, they both worked pretty much the same way: Each locale has its own language-specific char-based encoding (the "ANSI code page" in Windows).  Western languages used ISO-8859-1 or windows-1252, Russian used KOI8-R or windows-1251, etc.

When Windows NT added support for Unicode (int the early days when it was assumed that Unicode would use 16-bit characters), it did so by creating a parallel version of its API that used wchar_t instead of char.  For example, the MessageBox function was split into the two functions:

int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);


The "W" functions are the "real" ones.  The "A" functions exist for backwards compatibility with DOS-based Windows and mostly just convert their string arguments to UTF-16 and then call the corresponding "W" function.

In the Unix world (specifically, Plan 9), writing a whole new version of the POSIX API was seen as impractical, so Unicode support was approached in a different manner.  The existing support for multi-byte encoding in CJK locales was used to implement a new encoding now known as UTF-8.

The preference towards UTF-8 on Unix-like systems and UTF-16 on Windows is a huge pain the the ass when writing cross-platform code that supports Unicode.  Python tries to hide this from the programmer, but printing to the console is one of Joel's "leaky abstractions".
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复