问题
I would like to store a character (in order to compare it with other characters).
If I declare the variable like this :
char c = 'é';
everything works well, but I get these warnings :
warning: multi-character character constant [-Wmultichar]
char c = 'é';
^
ii.c:12:3: warning: overflow in implicit constant conversion [-Woverflow]
char c = 'é';
I think I understand why there is these warnings, but I wonder why does it still work?
And should I define it like this : int d = 'é';
although it takes more space in memory?
Moreover, I also get the warning below with this declaration :
warning: multi-character character constant [-Wmultichar]
int d = 'é';
Do I miss something? Thanks ;)
回答1:
é
has the Unicode code point 0xE9, the UTF-8 encoding is "\xc3\xa9"
.
I assume your source file is encoded in UTF-8, so
char c = 'é';
is (roughly) equivalent to
char c = '\xc3\xa9';
How such character constants are treated is implementation-defined. For GCC:
The compiler evaluates a multi-character character constant a character at a time, shifting the previous value left by the number of bits per target character, and then or-ing in the bit-pattern of the new character truncated to the width of a target character. The final bit-pattern is given type int, and is therefore signed, regardless of whether single characters are signed or not (a slight change from versions 3.1 and earlier of GCC). If there are more characters in the constant than would fit in the target int the compiler issues a warning, and the excess leading characters are ignored.
For example, 'ab' for a target with an 8-bit char would be interpreted as
(int) ((unsigned char) 'a' * 256 + (unsigned char) 'b')
, and'\234a'
as(int) ((unsigned char) '\234' * 256 + (unsigned char) 'a')
.
Hence, 'é'
has the value 0xC3A9, which fits into an int
(at least for 32-bit int
), but not into an (8-bit) char
, so the conversion to char
is again implementation-defined:
For conversion to a type of width N, the value is reduced modulo 2N to be within range of the type; no signal is raised.
This gives (with signed char
)
#include <stdio.h>
int main(void) {
printf("%d %d\n", 'é', (char)'é');
if((char)'é' == (char)'©') puts("(char)'é' == (char)'©'");
}
Output:
50089 -87
(char)'é' == (char)'©'
50089 is 0xC3A9, 87 is 0xA9.
So you lose information when storing é
into a char
(there are characters like ©
which compare equal to é
). You can
- Use
wchar_t
, an implementation-dependent wide character type which is 4 byte on Linux holding UTF-32:wchar_t c = L'é';
. You can convert them to the locale-specific multibyte-encoding (probably UTF-8, but you'll need to set the locale before, seesetlocale
; note, that changing the locale may change the behaviour of functions likeisalpha
orprintf
) bywcrtomb
or use them directly and also use wide strings (use theL
prefix to get wide character string literals) - Use a string and store UTF-8 in it (as in
const char *c = "é";
orconst char *c = "\u00e9";
orconst char *c = "\xc3\xa9;"
, with possibly different semantics; for C11, perhaps also look for UTF-8 string literals and theu8
prefix)
Note, that file streams have an orientation (cf. fwide
).
HTH
回答2:
Try using wchar_t
rather than char
. char
is a single byte, which is appropriate for ASCII but not for multi-byte character sets such as UTF-8. Also, flag your character literal as being a wide character rather than a narrow character:
#include <wchar.h>
...
wchar_t c = L'é';
来源:https://stackoverflow.com/questions/25100370/non-ascii-character-declaration