Properly print utf8 characters in windows console

前端未结

关注

 7  558

慢半拍i 2020-11-29 04:34

This is the way I try to do it:

#include 
#include 
using namespace std;

int main() {
  SetConsoleOutputCP(CP_UTF8);
   //ge


      
      
        
          7条回答        

        
                    
            
            
                         
                
              
              
                
                   情话喂你
                                             
                
                
                (楼主)
            
              
              
                2020-11-29 05:18
              

            
            
                        
Console can be set to display UTF-8 chars: @vladasimovic answers SetConsoleOutputCP(CP_UTF8) can be used for that. Alternatively, you can prepare your console by DOS command chcp 65001 or by system call system("chcp 65001 > nul") in the main program. Don't forget to save the source code in UTF-8 as well.

To check the UTF-8 support, run

#include 
#include 

BOOL CALLBACK showCPs(LPTSTR cp) {
  puts(cp);
  return true;
}

int main() {
  EnumSystemCodePages(showCPs,CP_SUPPORTED);
}


65001 should appear in the list.

Windows console uses OEM codepages by default and most default raster fonts support only national characters. Windows XP and newer also supports TrueType fonts, which should display missing chars (@Devenec suggests Lucida Console in his answer).

Why printf fails

As @bames53 points in his answer, Windows console is not a stream device, you need to write all bytes of multibyte character. Sometimes printf messes the job, putting the bytes to output buffer one by one. Try use sprintf and then puts the result, or force to fflush only accumulated output buffer.

If everything fails

Note the UTF-8 format: one character is displayed as 1-5 bytes. Use this function to shift to next character in the string:

const char* ucshift(const char* str, int len=1) {
  for(int i=0; i


...and this function to transform the bytes into unicode number:

int ucchar(const char* str) {
  if(!(*str&128)) return *str;
  unsigned char c = *str, bytes = 0;
  while((c<<=1)&128) ++bytes;
  int result = 0;
  for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
  int mask = 1;
  for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
  result|= (*str&mask)<<(6*bytes);
  return result;
}


Then you can try to use some wild/ancient/non-standard winAPI function like MultiByteToWideChar (don't forget to call setlocale() before!)

or you can use your own mapping from Unicode table to your active working codepage. Example:

int main() {
  system("chcp 65001 > nul");
  char str[] = "příšerně"; // file saved in UTF-8
  for(const char* p=str; *p!=0; p=ucshift(p)) {
    int c = ucchar(p);
    if(c<128) printf("%c\n",c);
    else printf("%d\n",c);
  }
}


This should print

p
345
237
353
e
r
n
283


If your codepage doesn't support that Czech interpunction, you could map 345=>r, 237=>i, 353=>s, 283=>e. There are at least 5(!) different charsets just for Czech. To display readable characters on different Windows locale is a horror.
    
             
                                                        
            

            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它7个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          

                              			
        

        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复