Is '\u0B95' a multicharacter literal?

你离开我真会死。 提交于 2019-12-30 08:12:01

问题


In a previous answer I gave, I responded to the following warning being caused by the fact that '\u0B95' requires three bytes and so is a multicharacter literal:

warning: multi-character character constant [-Wmultichar]

But actually, I don't think I'm right and I don't think gcc is either. The standard states:

An ordinary character literal that contains more than one c-char is a multicharacter literal.

One production rule for c-char is a universal-character-name (i.e. \uXXXX or \UXXXXXXXX). Since \u0B95 is a single c-char, this is not a multicharacter literal. But now it gets messy. The standard also says:

An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set.

So my literal has type char and value of the character in the execution character set (or implementation-defined value if it does not exist in that set). char is only defined to be large enough to store any member of the basic character set (which is not actually defined by the standard, but I assume it means the basic execution character set):

Objects declared as characters (char) shall be large enough to store any member of the implementation’s basic character set.

Therefore, since the execution character set is a superset of all the values a char can hold, my character may not fit in the char.

So what value does my char have? This doesn't seem to be defined anywhere. The standard does say that for char16_t literals, if the value is not representable, the program is ill-formed. It says nothing about ordinary literals, though.

So what's going on? Is this just a mess in the standard or am I missing something?


回答1:


I would argue as follows:

The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for literals with no prefix)... (From section 2.14.3.4)

If '\u0B95' falls outside of the implementation-defined range defined for char (which it would if char is 8 bits), it's value is then implementation defined, at which point GCC can make its value a sequence of multiple c-chars, thus becoming a multicharacter literal.




回答2:


Somebody posted an answer that correctly answered the second part of my question (what value will the char have?) but has since deleted their post. Since that part was correct, I'll reproduce it here together with my answer for the first part (is it a multicharacter literal?).


'\u0B95' is not a multicharacter literal and gcc is mistaken here. As stated in the question, a multicharacter literal is defined by (§2.14.3/1):

An ordinary character literal that contains more than one c-char is a multicharacter literal.

Since a universal-character-name is one expansion of a c-char, the literal '\u0B95' contains only one c-char. It would make sense if ordinary literals could not contain a universal-character-name for \u0B95 to be considered as six seperate characters (\, u, 0, etc.) but I cannot find this restriction anywhere. Therefore, it is a single character and the literal is not a multicharacter literal.

To further support this, why would it be considered to be multiple characters? At this point we haven't even given it an encoding so we don't know how many bytes it would take up. In UTF-16 it would take 2 bytes, in UTF-8 it would take 3 bytes and in some imagined encoding it could take just 1 byte.

So what value will the character literal have? First the universal-character-name is mapped to the corresponding encoding in the execution character set, unless it has not mapping in which case it has implementation-defined encoding (§2.14.3/5):

A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding.

Either way, the char literal gets the value equal to the numerical value of the encoding (§2.14.3/1):

An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set.

Now the important part, inconveniently tucked away in a different paragraph further in the section. If the value can not be represented in the char, it gets an implementation-defined value (§2.14.3/4):

The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for literals with no prefix) ...




回答3:


You are correct, according to the spec '\u0B95' is a char-typed character literal with a value equal to the character's encoding in the execution character set. And you're right that the spec doesn't say anything about the case where this is not possible for char literals due to a single char being unable to represent that value. The behavior is undefined.

There are defect reports filed with the committee on this issue: E.g., http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html#912

The currently proposed resolution seems to be to specify that these character literals are also ints and have implementation defined values (although the proposed language isn't quite right for that), just like multichar literals. I'm not a fan of that solution, and I think a better solution is to say such literals are ill-formed.

This is what's implemented in clang: http://coliru.stacked-crooked.com/a/952ce7775dcf7472




回答4:


Because you have no character encoding prefix gcc (and any other conformant compiler) will see '\u0B95' and think 1) char type and 2) multicharacter because there is more than one char code in the string.

  • u'\u0B95' is a UTF16 character.
  • u'\u0B95\u0B97' is a multicharacter UTF16 character.
  • U'\ufacebeef' is a UTF32 character.

etc.



来源:https://stackoverflow.com/questions/13547368/is-u0b95-a-multicharacter-literal

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!