Why does stringizing an euro sign within a string literal using UTF8 not produce an UCN?

余生长醉 提交于 2019-12-04 04:07:32

It's simply a bug. §2.1/1 says about Phase 1,

(An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)

This is not a note or footnote. C++0x adds an exception for raw string literals, which might solve your problem at hand if you have one.

This program clearly demonstrates the malfunction:

#include <iostream>

#define GET_UCN(X) L ## #X

int main() {
std::wcout << GET_UCN("€") << '\n' << GET_UCN("\u20AC") << '\n';
}

http://ideone.com/lb9jc

Because both strings are wide, the first is required to be corrupted into several characters if the compiler fails to interpret the input multibyte sequence. In your given example, total lack of support for UTF-8 could cause the compiler to slavishly echo the sequence right through.

"and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set"

used to be

"or universal-character-name in character literals and string literals is converted to a member of the execution character set"

Maybe you need a future version of g++.

I'm not sure where you got that citation for translation phase 1—the C99 standard says this about translation phase 1 in §5.1.1.2/1:

Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.

So in this case, the Euro character € (represented as the multibyte sequence E2 82 AC in UTF-8) is mapped into the execution character set, which also happens to be UTF-8, so its representation remains the same. It doesn't get converted into a universal character name because, well, there's nothing that says that it should.

I suspect you'll find that the euro sign does not satisfy the condition Any source file character not in the basic source character set so the rest of the text you quote doesn't apply.

Open your test file with your favourite binary editor and check what value is used to represent the euro sign in GET_UCN("€")

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!