Char * encoding

后端 未结 2 1639
执念已碎
执念已碎 2020-12-30 06:14

If I write the statement below in C++ under Visual Studio, what will be encoding here?

const char *c = \"£\";

Under the Visual Studio proje

相关标签:
2条回答
  • 2020-12-30 06:27

    Setting the charset to 'Not Set' simply means that neither of the preprocessor macros _UNICODE and _MBCS will be set. This has no effect on what character sets are used by the compiler.

    The two settings that determine how the bytes of your source are converted to a string literal in the program are the 'source character set' and the 'execution character set'. The compiler will convert string literals from the source encoding to the execution encoding.

    Source encoding:

    The source encoding is the encoding used by the compiler to interpret the source file's bytes. It applies not just to string and character literals, but also to everything else in source including, for example, identifiers.

    If Visual Studio's compiler detects a Unicode 'signature' in a source file then it will use the corresponding Unicode encoding as the source encoding. Otherwise it will use the system's codepage encoding as the source encoding.

    Execution encoding:

    The execution encoding is the encoding the compiler stores string and character literals as, such that the string and character data created by literals will be encoded using the execution encoding.

    Visual Studio's compiler uses the system's codepage as the execution encoding.


    When Visual Studio performs the conversion of string and character literal data from the source encoding to the execution encoding it will replace characters that cannot be represented in the execution encoding set with '?'.

    So for your example:

    const char *c = "£";
    

    Assuming that your source is saved using Microsoft's "UTF-8 with signature" format and your system uses CP1252 as most systems in the West do, the string literal will be converted to:

    0xA3 0x00
    

    On the other hand, if the execution charset is something that doesn't include '£', such as cp1251 (Cyrillic, used in Window's Russian locale), then the string literal will end up:

    0x3F 0x00
    

    If you want to avoid depending on the source code encoding you can use Universal Character Names (UCNs):

    const char *c = "\u00A3"; // "£"
    

    If you want to guarantee a UTF-8 representation you'll also need to avoid dependence on the execution encoding. You can do that by manually encoding it:

    const char *c = "\xC2\xA3"; // UTF-8 encoding of "£"
    

    C++11 introduces UTF-8 string literals, which will be better when your compiler supports them:

    const char *c = u8"£";
    

    or

    const char *c = u8"\u00A3"; // "£"
    
    0 讨论(0)
  • 2020-12-30 06:45

    Since VS2015 update 2, there are now new options to control this, see this link. Here is a relevant quote:

    "There is also a /utf-8 option that is a synonym for setting “/source-charset:utf-8” and “/execution-charset:utf-8”."

    0 讨论(0)
提交回复
热议问题