Having trouble to understand the semantics of u8-literals, or rather, understanding the result on g++ 4.8.1
This is my expectation:
const std::string
In order to illustrate this discussion, here are some examples. Let's consider the code:
int main() {
std::cout << "åäö\n";
}
1) Compiling this with g++ -std=c++11 encoding.cpp will produce an executable that yields:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
In other words, two bytes per "grapheme cluster" (according to unicode jargon, i.e. in this case, per character), plus the final newline (0a). This is because my file is encoded in utf-8, the input-charset is assumed to be utf-8 by cpp, and the exec-charset is utf-8 by default in gcc (see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html). Good.
2) Now if I convert my file to iso-8859-1 and compile again using the same command, I get:
% ./a.out | od -txC
0000000 e5 e4 f6 0a
i.e. the three characters are now encoded using iso-8859-1. I am not sure about the magic going on here, as this time it seems that cpp correctly guessed that the file was iso-8859-1 (without any hint), converted it to utf-8 internally (according to the link above) but the compiler still stored the iso-8859-1 string in the binary. This we can check by looking at the .rodata section of the binary:
% objdump -s -j .rodata a.out
a.out: file format elf64-x86-64
Contents of section .rodata:
400870 01000200 00e5e4f6 0a00 ..........
(Note the "e5e4f6" sequence of bytes).
This makes perfect sense as a programmer who uses latin-1 literals does not expect them to come out as utf-8 strings in his program's output.
3) Now if I keep the same iso-8859-1-encoded file, but compile with g++ -std=c++11 -finput-charset=iso-8859-1 encoding.cpp, then I get a binary that ouptuts utf-8 data:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
I find this weird: the source encoding has not changed, I explicitly tell gcc it is latin-1, and I get utf-8 as a result! Note that this can be overriden if I explicitly request the exec-charset with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp:
% ./a.out | od -txC
0000000 e5 e4 f6 0a
It is not clear to me how these two options interact...
4) Now let's add the "u8" prefix into the mix:
int main() {
std::cout << u8"åäö\n";
}
If the file is utf-8-encoded, unsurprisingly compiling with defaults char-sets (g++ -std=c++11 encoding.cpp), the output is utf-8 as well. If I request the compiler to use iso-8859-1 internally instead (g++ -std=c++11 -fexec-charset=iso-8859-1 encoding.cpp), the output is still utf-8:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
So it looks like the prefix "u8" prevented the compiler to convert the literal to the execution character set. Even better, if I convert the same source file to iso-8859-1, and compile with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp, then I still get utf-8 output:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
So it seems that"u8" actually acts as an "operator" that tells the compiler "convert this literal to utf-8".