Difference between MBCS and UTF-8 on Windows

后端 未结 4 1101
醉话见心
醉话见心 2020-11-28 03:23

I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What i

4条回答
  •  半阙折子戏
    2020-11-28 04:08

    I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ?

    Many functions in the Windows API come in two versions: One that takes char parameters (in a locale-specific code page) and one that takes wchar_t parameters (in UTF-16).

    int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
    int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);
    

    Each of these function pairs also has a macro without the suffix, that depends on whether the UNICODE macro is defined.

    #ifdef UNICODE
       #define MessageBox MessageBoxW
    #else
       #define MessageBox MessageBoxA
    #endif
    

    In order to make this work, the TCHAR type is defined to abstract away the character type used by the API functions.

    #ifdef UNICODE
        typedef wchar_t TCHAR;
    #else
        typedef char TCHAR;
    #endif
    

    This, however, was a bad idea. You should always explicitly specify the character type.

    What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ?

    MBCS stands for "multi-byte character set". For the literal minded, it seems that UTF-8 would qualify.

    But in Windows, "MBCS" only refers to character encodings that can be used with the "A" versions of the Windows API functions. This includes code pages 932 (Shift_JIS), 936 (GBK), 949 (KS_C_5601-1987), and 950 (Big5), but NOT UTF-8.

    To use UTF-8, you have to convert the string to UTF-16 using MultiByteToWideChar, call the "W" version of the function, and call WideCharToMultiByte on the output. This is essentially what the "A" functions actually do, which makes me wonder why Windows doesn't just support UTF-8.

    This inability to support the most common character encoding makes the "A" version of the Windows API useless. Therefore, you should always use the "W" functions.

    Unicode is a 16-bit character encoding

    This negates whatever I read about the Unicode.

    MSDN is wrong. Unicode is a 21-bit coded character set that has several encodings, the most common being UTF-8, UTF-16, and UTF-32. (There are other Unicode encodings as well, such as GB18030, UTF-7, and UTF-EBCDIC.)

    Whenever Microsoft refers to "Unicode", they really mean UTF-16 (or UCS-2). This is for historical reasons. Windows NT was an early adopter of Unicode, back when 16 bits was thought to be enough for everyone, and UTF-8 was only used on Plan 9. So UCS-2 was Unicode.

提交回复
热议问题