Does the multibyte-to-wide-string conversion function “mbstowcs”, when passed a string literal, use the encoding of the source file?

前端 未结 2 1688
轮回少年
轮回少年 2020-12-17 03:13

ADDENDUM A tentative answer of my own appears at the bottom of the question.


I am converting an archaic VC6 C++/MFC project to VS2013 and Unic

相关标签:
2条回答
  • mbstowcs() semantics are defined in terms of the currently installed C locale. If you are processing string with different encodings you will need to use setlocale() to change what encoding is currently being used. The relevant statement in the C standard is in 7.22.8 paragraph 1:

    The behavior of the multibyte string functions is affected by the LC_CTYPE category of the current locale.

    I don't know enough about the C library but as far as I know none of these functions is really thread-safe. I consider it much easier to deal with different encodings and, in general, cultural conventions, using the C++ std::locale facilities. With respect to encoding conversions you'd look at the std::codecvt<...> facets. Admittedly, these aren't easy to use, though.

    The current locale needs a bit of clarification: the program has a current global locale. Initially, this locale is somehow set up by the system and is possibly controlled by the user's environment in some form. For example, on UNIX system there are environment variables which choose the initial locale. Once the program is running, it can change the current locale, however. How that is done depends a bit on what is being used exactly: a running C++ program actually has two locales: one used by the C library and one used by the C++ library.

    The C locale is used for all locale dependent function from the C library, e.g., mbstowcs() but also for tolower() and printf(). The C++ locale is used for all locale dependent function which are specific to the C++ library. Since C++ uses locale objects the global locale is just used as the default for entities not setting a locale specifically, and primarily for the stream (you'd set a stream's locale using s.imbue(loc)). Depending on which locale you set, there are different methods to set the global locale:

    1. For the C locale you use setlocale().
    2. For the C++ locale you use std::locale::global().
    0 讨论(0)
  • 2020-12-17 03:38

    The encoding of the source code file doesn't affect the behavior of mbstowcs. After all, the internal implementation of the function is unaware of what source code might be calling it.

    On the MSDN documentation you linked is:

    mbstowcs uses the current locale for any locale-dependent behavior; _mbstowcs_l is identical except that it uses the locale passed in instead. For more information, see Locale.

    That linked page about locales then references setlocale which is how the behavior of mbstowcs can be affected.

    Now, taking a look at your proposed way of passing UTF-8:

    mbstowcs (dest, u8"Hello, world!", 1024);
    

    Unfortunately, that isn't going to work properly as far as I know once you use interesting data. If it even compiles, it only does do because the compiler would have to be treating u8 the same as a char*. And as far as mbstowcs is concerned, it will believe the string is encoded under whatever the locale is set for.

    Even more unfortunately, I don't believe there's any way (on the Windows / Visual Studio platform) to set a locale such that UTF-8 would be used.

    So that would happen to work for ASCII characters (the first 128 characters) only because they happen to have the exact same binary values in various ANSI encodings as well as UTF-8. If you try with any characters beyond that (for instance anything with an accent or umlaut) then you'll see problems.


    Personally, I think mbstowcs and such are rather limited and clunky. I've found the Window's API function MultiByteToWideChar to be more effective in general. In particular it can easily handle UTF-8 just by passing CP_UTF8 for the code page parameter.

    0 讨论(0)
提交回复
热议问题