Difference between MBCS and UTF-8 on Windows

后端 未结 4 1102
醉话见心
醉话见心 2020-11-28 03:23

I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What i

4条回答
  •  悲哀的现实
    2020-11-28 04:00

    _MBCS and _UNICODE are macros to determine which version of TCHAR.H routines to call. For example, if you use _tcsclen to count the length of a string, the preprocessor would map _tcsclen to different version according to the two macros: _MBCS and _UNICODE.

    _UNICODE & _MBCS Not Defined: strlen  
    _MBCS Defined: _mbslen  
    _UNICODE Defined: wcslen  
    

    To explain the difference of these string length counting functions, consider following example.
    If you have a computer box that run Windows Simplified Chinese edition which use GBK(936 code page), you compile a gbk-file-encoded source file and run it.

    printf("%d\n", _mbslen((const unsigned char*)"I爱你M"));
    printf("%d\n", strlen("I爱你M"));
    printf("%d\n", wcslen((const wchar_t*)"I爱你M"));
    

    The result would be 4 6 3.

    Here is the hexdecimal representation of I爱你M in GBK.

    GBK:             49 B0 AE C4 E3 4D 00                
    

    _mbslen knows this string is encoded in GBK, so it could intepreter the string correctly and get the right result 4 words: 49 as I, B0 AE as , C4 E3 as , 4D as M.

    strlen only knows 0x00, so it get 6.

    wcslen consider this hexdeciaml array is encoded in UTF16LE, and it count two bytes as one word, so it get 3 words: 49 B0, AE C4, E3 4D.

    as @xiaokaoy pointed out, the only valid terminator for wcslen is 00 00. Thus the result is not guranteed to be 3 if the following byte is not 00.

提交回复
热议问题