I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What i
_MBCS and _UNICODE are macros to determine which version of TCHAR.H routines to call. For example, if you use _tcsclen to count the length of a string, the preprocessor would map _tcsclen to different version according to the two macros: _MBCS and _UNICODE.
_UNICODE & _MBCS Not Defined: strlen
_MBCS Defined: _mbslen
_UNICODE Defined: wcslen
To explain the difference of these string length counting functions, consider following example.
If you have a computer box that run Windows Simplified Chinese edition which use GBK(936 code page), you compile a gbk-file-encoded source file and run it.
printf("%d\n", _mbslen((const unsigned char*)"I爱你M"));
printf("%d\n", strlen("I爱你M"));
printf("%d\n", wcslen((const wchar_t*)"I爱你M"));
The result would be 4 6 3.
Here is the hexdecimal representation of I爱你M in GBK.
GBK: 49 B0 AE C4 E3 4D 00
_mbslen knows this string is encoded in GBK, so it could intepreter the string correctly and get the right result 4 words: 49 as I, B0 AE as 爱, C4 E3 as 你, 4D as M.
strlen only knows 0x00, so it get 6.
wcslen consider this hexdeciaml array is encoded in UTF16LE, and it count two bytes as one word, so it get 3 words: 49 B0, AE C4, E3 4D.
as @xiaokaoy pointed out, the only valid terminator for wcslen is 00 00. Thus the result is not guranteed to be 3 if the following byte is not 00.