Raw string literals and file codification

问题

C++11 introduced the raw string literals which can be pretty useful to represent quoted strings, literals with lots of special symbols like windows file paths, regex expressions etc...

std::string path = R"(C:\teamwork\new_project\project1)"; // no tab nor newline!
std::string quoted = R"("quoted string")";
std::string expression = R"([\w]+[ ]+)";

This raw string literals can also be combined with encoding prefixes (u8, u, U, or L), but, when no encoding prefix is specified, does the file encoding matters?, lets suppose that I have this code:

auto message = R"(Pick up a card)";         // raw string 1
auto cards = R"(🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪🂫🂬🂭🂮)"; // raw string 2

If I can write and store the code above, its obvious that my source code is encoded as unicode, so I'm wondering:

The raw string 1 would be a unicode literal? (though it only uses ASCII characters), in other words, does the raw string inherits the codification of the file where is written or the compiler auto-detects that unicode isn't needed regardless of the file encoding?
Would be necessary the encoding prefix U on the raw string 2 in order to treat it as unicode literal or it would be unicode automatically due to its contents and/or the source file encoding?

Thanks for your attention.

EDIT:

Testing the code above in ideone.com and printing the demangled type of message and cards variables, it outputs char const*:

template<typename T> std::string demangle(T t)
{
    int status;
    char *const name = abi::__cxa_demangle(typeid(T).name(), 0, 0, &status);
    std::string result(name);
    free(name);
    return result;
}

int main()
{
    auto message = R"(Pick up a card)";
    auto cards = R"(🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪🂫🂬🂭🂮)";

    std::cout
        << "message type: " << demangle(message) << '\n'
        << "cards type: " << demangle(cards) << '\n';

    return 0;
}

Output:

message type: char const*

cards type: char const*

which is even most weird than I thought, I was convinced that the type would be wchar_t (even without the L prefix).

回答1:

Yes it matters, even to compile your source. You will gonna need to use somenthing like -finput-charset=UTF-16 to compile if you are using gcc (the same thing should apply to VS).

But I IHMO, there are something more fundamental to take into account in your code. For example, std::string are containers to char, which is 1 byte large. If you are dealing with a UTF-16 for instance, you will need 2 bytes, so (despite a 'by-hand conversion') you will need at least a wchar_t (std::wstring) (or, to be safer a char16_t, to be safer in C++11).

So, to use Unicode you will need a container for it and a compiling environment prepared to handle your Unicode codifided sources.

回答2:

Raw string literals change how escapes are dealt with but do not change how encodings are handled. Raw string literals still convert their contents from the source encoding to produce a string in the appropriate execution encoding.

The type of a string literal and the appropriate execution encoding is determined entirely by the prefix. R alone always produces a char string in the narrow execution encoding. If the source is UTF-16 (and the compiler supports UTF-16 as the source encoding) then the compiler will convert the string literal contents from UTF-16 to the narrow execution encoding.

来源：https://stackoverflow.com/questions/21460700/raw-string-literals-and-file-codification

标签

c++

c++11

string-literals

rawstring