How are u8-literals supposed to work?

后端 未结 2 1752
渐次进展
渐次进展 2020-12-15 06:32

Having trouble to understand the semantics of u8-literals, or rather, understanding the result on g++ 4.8.1

This is my expectation:

const std::string         


        
2条回答
  •  余生分开走
    2020-12-15 07:06

    In order to illustrate this discussion, here are some examples. Let's consider the code:

    int main() {
      std::cout << "åäö\n";
    }
    

    1) Compiling this with g++ -std=c++11 encoding.cpp will produce an executable that yields:

    % ./a.out | od -txC
    0000000 c3 a5 c3 a4 c3 b6 0a
    

    In other words, two bytes per "grapheme cluster" (according to unicode jargon, i.e. in this case, per character), plus the final newline (0a). This is because my file is encoded in utf-8, the input-charset is assumed to be utf-8 by cpp, and the exec-charset is utf-8 by default in gcc (see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html). Good.

    2) Now if I convert my file to iso-8859-1 and compile again using the same command, I get:

    % ./a.out | od -txC
    0000000 e5 e4 f6 0a
    

    i.e. the three characters are now encoded using iso-8859-1. I am not sure about the magic going on here, as this time it seems that cpp correctly guessed that the file was iso-8859-1 (without any hint), converted it to utf-8 internally (according to the link above) but the compiler still stored the iso-8859-1 string in the binary. This we can check by looking at the .rodata section of the binary:

    % objdump -s -j .rodata a.out
    
    a.out:     file format elf64-x86-64
    
    Contents of section .rodata:
    400870 01000200 00e5e4f6 0a00               ..........
    

    (Note the "e5e4f6" sequence of bytes).
    This makes perfect sense as a programmer who uses latin-1 literals does not expect them to come out as utf-8 strings in his program's output.

    3) Now if I keep the same iso-8859-1-encoded file, but compile with g++ -std=c++11 -finput-charset=iso-8859-1 encoding.cpp, then I get a binary that ouptuts utf-8 data:

    % ./a.out | od -txC
    0000000 c3 a5 c3 a4 c3 b6 0a
    

    I find this weird: the source encoding has not changed, I explicitly tell gcc it is latin-1, and I get utf-8 as a result! Note that this can be overriden if I explicitly request the exec-charset with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp:

    % ./a.out | od -txC
    0000000 e5 e4 f6 0a
    

    It is not clear to me how these two options interact...

    4) Now let's add the "u8" prefix into the mix:

    int main() {
      std::cout << u8"åäö\n";
    }
    

    If the file is utf-8-encoded, unsurprisingly compiling with defaults char-sets (g++ -std=c++11 encoding.cpp), the output is utf-8 as well. If I request the compiler to use iso-8859-1 internally instead (g++ -std=c++11 -fexec-charset=iso-8859-1 encoding.cpp), the output is still utf-8:

    % ./a.out | od -txC
    0000000 c3 a5 c3 a4 c3 b6 0a
    

    So it looks like the prefix "u8" prevented the compiler to convert the literal to the execution character set. Even better, if I convert the same source file to iso-8859-1, and compile with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp, then I still get utf-8 output:

    % ./a.out | od -txC
    0000000 c3 a5 c3 a4 c3 b6 0a
    

    So it seems that"u8" actually acts as an "operator" that tells the compiler "convert this literal to utf-8".

提交回复
热议问题