Subtlety in conversion of characters to integers

问题

Can someone explain clearly what these lines from K&R actually mean:

"When a char is converted to an int, can it ever produce a negative integer? The answer varies from machine to machine. The definition of C guarantees that any character in the machine's standard printing character set will never be negative, but arbitrary bit patterns stored in character variables may appear to be negative on some machines,yet positive on others".

回答1:

You need to understand several things first.

If I take an 8-bit value and extend it to a 16-bit value, normally you would imagine just adding a bunch of 0's on the left. For example, if I have the 8-bit value 23, in binary that's 00010111, so as a 16-bit number it's 0000000000010111, which is also 23.
It turns out that negative numbers always have a 1 in the high-order bit. (There might be weird machines for which this is not true, but it's true for any machine you're likely to use.) For example, the 8-bit value -40 is represented in binary as 11011000.
So when you convert a signed 8-bit value to a 16-bit value, if the high-order bit is 1 (that is, if the number is negative), you do not add a bunch of 0-s on the left, you add a bunch of 1's instead. For example, going back to -40, we would convert 11011000 to 1111111111011000, which is the 16-bit representation of -40.
There are also unsigned numbers, that are never negative. For example, the 8-bit unsigned number 216 is represented as 11011000. (You will notice that this is the same bit pattern as the signed number -40 had.) When you convert an unsigned 8-bit number to 16 bits, you add a bunch of 0's no matter what. For example, you would convert 11011000 to 0000000011011000, which is the 16-bit representation of 216.
So, putting this all together, if you're converting an 8-bit number to 16 (or more) bits, you have to look at two things. First, is the number signed or unsigned? If it's unsigned, just add a bunch of 0's on the left. But if it's signed, you have to look at the high-order bit of the 8-0bit number. If it's 0 (if the number is positive), add a bunch of 0's on the left. But if it's 1 (if the number is negative), add a bunch of 1's on the right. (This whole process is known as sign extension.)
The ordinary ASCII characters (like 'A' and '1' and '$') all have values less than 128, which means that their high-order bit is always 0. But "special" characters from the "Latin-1" or UTF-8 character sets have values greater than 128. For this reason they're sometimes also called "high bit" or "eighth bit" characters. For example, the Latin-1 character Ø (O with a slash through it it) has the value 216.
Finally, although type char in C is typically an 8-bit type, the C Standard does not specify whether it is signed or unsigned.

Putting this all together, what Kernighan and Ritchie are saying is that when we convert a char to a 16- or 32-bit integer, we don't quite know how to apply step 5. If I'm on a machine where type char is unsigned, and I take the character Ø and convert it to an int, I'll probably get the value 216. But if I'm on a machine where type char is signed, I'll probably get the number -40.

回答2:

There are two more-or-less relevant parts to the standard — ISO/IEC 9899:2011.

6.2.5 Types

¶3 An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative. If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of values that can be represented in that type.

¶15 The three types char, signed char, and unsigned char are collectively called the character types. The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char.⁴⁵⁾

⁴⁵⁾CHAR_MIN, defined in <limits.h>, will have one of the values 0 or SCHAR_MIN, and this can be used to distinguish the two options. Irrespective of the choice made, char is a separate type from the other two and is not compatible with either.

That defines what your quote from K&R states. The other relevant part defines what the basic execution character set is.

5.2.1 Character sets

¶1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.

¶2 In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.

¶3 Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab, vertical tab, and form feed. The representation of each member of the source and execution basic character sets shall fit in a byte. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character. In the basic execution character set, there shall be control characters representing alert, backspace, carriage return, and new line. If any other characters are encountered in a source file (except in an identifier, a character constant, a string literal, a header name, a comment, or a preprocessing token that is never converted to a token), the behavior is undefined.

¶4 A letter is an uppercase letter or a lowercase letter as defined above; in this International Standard the term does not include other characters that are letters in other alphabets.

¶5 The universal character name construct provides a way to name other characters.

One consequence of these rules is that if a machine uses 8-bit character and EBCDIC encoding, then plain char must be an unsigned type since the digits have code 240..249 in EBCDIC.

来源：https://stackoverflow.com/questions/44077321/subtlety-in-conversion-of-characters-to-integers

标签

kernighan-and-ritchie