non-ASCII character declaration

问题

I would like to store a character (in order to compare it with other characters).

If I declare the variable like this :

char c = 'é';

everything works well, but I get these warnings :

warning: multi-character character constant [-Wmultichar]
   char c = 'é';
            ^
ii.c:12:3: warning: overflow in implicit constant conversion [-Woverflow]
   char c = 'é';

I think I understand why there is these warnings, but I wonder why does it still work? And should I define it like this : int d = 'é'; although it takes more space in memory? Moreover, I also get the warning below with this declaration :

warning: multi-character character constant [-Wmultichar]

int d = 'é';

Do I miss something? Thanks ;)

回答1:

é has the Unicode code point 0xE9, the UTF-8 encoding is "\xc3\xa9".

I assume your source file is encoded in UTF-8, so

char c = 'é';

is (roughly) equivalent to

char c = '\xc3\xa9';

How such character constants are treated is implementation-defined. For GCC:

The compiler evaluates a multi-character character constant a character at a time, shifting the previous value left by the number of bits per target character, and then or-ing in the bit-pattern of the new character truncated to the width of a target character. The final bit-pattern is given type int, and is therefore signed, regardless of whether single characters are signed or not (a slight change from versions 3.1 and earlier of GCC). If there are more characters in the constant than would fit in the target int the compiler issues a warning, and the excess leading characters are ignored.

For example, 'ab' for a target with an 8-bit char would be interpreted as (int) ((unsigned char) 'a' * 256 + (unsigned char) 'b'), and '\234a' as (int) ((unsigned char) '\234' * 256 + (unsigned char) 'a').

Hence, 'é' has the value 0xC3A9, which fits into an int (at least for 32-bit int), but not into an (8-bit) char, so the conversion to char is again implementation-defined:

For conversion to a type of width N, the value is reduced modulo 2^N to be within range of the type; no signal is raised.

This gives (with signed char)

#include <stdio.h>
int main(void) {
    printf("%d %d\n", 'é', (char)'é');
    if((char)'é' == (char)'©') puts("(char)'é' == (char)'©'");
}

Output:

50089 -87
(char)'é' == (char)'©'

50089 is 0xC3A9, 87 is 0xA9.

So you lose information when storing é into a char (there are characters like © which compare equal to é). You can

Use wchar_t, an implementation-dependent wide character type which is 4 byte on Linux holding UTF-32: wchar_t c = L'é';. You can convert them to the locale-specific multibyte-encoding (probably UTF-8, but you'll need to set the locale before, see setlocale; note, that changing the locale may change the behaviour of functions like isalpha or printf) by wcrtomb or use them directly and also use wide strings (use the L prefix to get wide character string literals)
Use a string and store UTF-8 in it (as in const char *c = "é"; or const char *c = "\u00e9"; or const char *c = "\xc3\xa9;", with possibly different semantics; for C11, perhaps also look for UTF-8 string literals and the u8 prefix)

Note, that file streams have an orientation (cf. fwide).

HTH

回答2:

Try using wchar_t rather than char. char is a single byte, which is appropriate for ASCII but not for multi-byte character sets such as UTF-8. Also, flag your character literal as being a wide character rather than a narrow character:

#include <wchar.h>
...
wchar_t c = L'é';

来源：https://stackoverflow.com/questions/25100370/non-ascii-character-declaration

标签

character

special-characters