Properly print utf8 characters in windows console

前端 未结 7 558
慢半拍i
慢半拍i 2020-11-29 04:34

This is the way I try to do it:

#include 
#include 
using namespace std;

int main() {
  SetConsoleOutputCP(CP_UTF8);
   //ge         


        
7条回答
  •  情话喂你
    2020-11-29 05:18

    Console can be set to display UTF-8 chars: @vladasimovic answers SetConsoleOutputCP(CP_UTF8) can be used for that. Alternatively, you can prepare your console by DOS command chcp 65001 or by system call system("chcp 65001 > nul") in the main program. Don't forget to save the source code in UTF-8 as well.

    To check the UTF-8 support, run

    #include 
    #include 
    
    BOOL CALLBACK showCPs(LPTSTR cp) {
      puts(cp);
      return true;
    }
    
    int main() {
      EnumSystemCodePages(showCPs,CP_SUPPORTED);
    }
    

    65001 should appear in the list.

    Windows console uses OEM codepages by default and most default raster fonts support only national characters. Windows XP and newer also supports TrueType fonts, which should display missing chars (@Devenec suggests Lucida Console in his answer).

    Why printf fails

    As @bames53 points in his answer, Windows console is not a stream device, you need to write all bytes of multibyte character. Sometimes printf messes the job, putting the bytes to output buffer one by one. Try use sprintf and then puts the result, or force to fflush only accumulated output buffer.

    If everything fails

    Note the UTF-8 format: one character is displayed as 1-5 bytes. Use this function to shift to next character in the string:

    const char* ucshift(const char* str, int len=1) {
      for(int i=0; i

    ...and this function to transform the bytes into unicode number:

    int ucchar(const char* str) {
      if(!(*str&128)) return *str;
      unsigned char c = *str, bytes = 0;
      while((c<<=1)&128) ++bytes;
      int result = 0;
      for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
      int mask = 1;
      for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
      result|= (*str&mask)<<(6*bytes);
      return result;
    }
    

    Then you can try to use some wild/ancient/non-standard winAPI function like MultiByteToWideChar (don't forget to call setlocale() before!)

    or you can use your own mapping from Unicode table to your active working codepage. Example:

    int main() {
      system("chcp 65001 > nul");
      char str[] = "příšerně"; // file saved in UTF-8
      for(const char* p=str; *p!=0; p=ucshift(p)) {
        int c = ucchar(p);
        if(c<128) printf("%c\n",c);
        else printf("%d\n",c);
      }
    }
    

    This should print

    p
    345
    237
    353
    e
    r
    n
    283
    

    If your codepage doesn't support that Czech interpunction, you could map 345=>r, 237=>i, 353=>s, 283=>e. There are at least 5(!) different charsets just for Czech. To display readable characters on different Windows locale is a horror.

提交回复
热议问题