Difference between MBCS and UTF-8 on Windows

后端未结

关注

 4  1102

醉话见心 2020-11-28 03:23

I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What i

4条回答

悲哀的现实 (楼主)

2020-11-28 04:00
_MBCS and _UNICODE are macros to determine which version of TCHAR.H routines to call. For example, if you use _tcsclen to count the length of a string, the preprocessor would map _tcsclen to different version according to the two macros: _MBCS and _UNICODE.
```
_UNICODE & _MBCS Not Defined: strlen  
_MBCS Defined: _mbslen  
_UNICODE Defined: wcslen  
```
To explain the difference of these string length counting functions, consider following example.
If you have a computer box that run Windows Simplified Chinese edition which use GBK(936 code page), you compile a gbk-file-encoded source file and run it.
```
printf("%d\n", _mbslen((const unsigned char*)"I爱你M"));
printf("%d\n", strlen("I爱你M"));
printf("%d\n", wcslen((const wchar_t*)"I爱你M"));
```
The result would be 4 6 3.

Here is the hexdecimal representation of I爱你M in GBK.
```
GBK:             49 B0 AE C4 E3 4D 00                
```
_mbslen knows this string is encoded in GBK, so it could intepreter the string correctly and get the right result 4 words: 49 as I, B0 AE as 爱, C4 E3 as 你, 4D as M.

strlen only knows 0x00, so it get 6.

wcslen consider this hexdeciaml array is encoded in UTF16LE, and it count two bytes as one word, so it get 3 words: 49 B0, AE C4, E3 4D.

as @xiaokaoy pointed out, the only valid terminator for wcslen is 00 00. Thus the result is not guranteed to be 3 if the following byte is not 00.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...