问题
I have a file that I've read into an array of data type signed char. I cannot change this fact.
I would now like to do this: !((c[i] & 0xc0) & 0x80) where c[i] is one of the signed characters.
Now, I know from section 6.5.10 of the C99 standard that "Each of the operands [of the bitwise AND] shall have integral type."
And Section 6.5 of the C99 specification tells me:
Some operators (the unary operator ~ , and the binary operators << , >> , & , ^ , and | , collectively described as bitwise operators )shall have operands that have integral type. These operators return values that depend on the internal representations of integers, and thus have implementation-defined aspects for signed types.
My question is two-fold:
Since I want to work with the original bit patterns from the file, how can I convert/cast my
signed chartounsigned charso that the bit patterns remain unchanged?Is there a list of these "implementation-defined aspects" anywhere (say for MVSC and GCC)?
Or you could take a different route and argue that this produces the same result for both signed and unsigned chars for any value of c[i].
Naturally, I will reward references to relevant standards or authoritative texts and discourage "informed" speculation.
回答1:
As others point out, in all likelyhood your implementation is based on two's complement, and will give exactly the result you expect.
However, if you're worried about the results of an operation involving a signed value, and all you care about is the bit pattern, simply cast directly to an equivalent unsigned type. The results are defined under the standard:
6.3.1.3 Signed and unsigned integers
...
Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.
This is essentially specifying that the result will be the two's complement representation of the value.
Fundamental to this is that in two's complement maths the result of a calculation is modulo some power of two (i.e. the number of bits in the type), which in turn is exactly equivalent to masking off the relevant number of bits. And the complement of a number is the number subtracted from the power of two.
Thus adding a negative value is the same as adding any value which differs from the value by a multiple of that power of two.
i.e:
(0 + signed_value) mod (2^N)
==
(2^N + signed_value) mod (2^N)
==
(7 * 2^N + signed_value) mod (2^N)
etc. (if you know modulo, that should be pretty self-evidently true)
So if you have a negative number, adding a power of two will make it positive (-5 + 256 = 251), but the bottom 'N' bits will be exactly the same (0b11111011) and it will not affect the outcome of a mathematical operation. As values are then truncated to fit the type, the result is exactly the binary value you expected with even if the result 'overflows' (i.e. what you might think happens if the number was positive to start with - this wrapping is also well defined behaviour).
So in 8-bit two's complement:
- -5 is the same as 251 (i.e 256 - 5) - 0b11111011
- If you add 30, and 251, you get 281. But that's larger than 256, and 281 mod 256 equals 25. Exactly the same as 30 - 5.
- 251 * 2 = 502. 502 mod 256 = 246. 246 and -10 are both 0b11110110.
Likewise if you have:
unsigned int a;
int b;
a - b == a + (unsigned int) -b;
Under the hood, this cast is unlikely to be implemented with arithmetic and will certainly be a straight assignment from one register/value to another, or just optimised out altogether as the maths does not make a distinction between signed and unsigned (intepretation of CPU flags is another matter, but that's an implementation detail). The standard exists to ensure that an implementation doesn't take it upon itself to do something strange instead, or I suppose, for some weird architecture which isn't using two's complement...
回答2:
unsigned char UC = *(unsigned char*)&C - this is how you can convert signed C to unsigned keeping the "bit pattern". Thus you could change your code to something like this:
!(( (*(unsigned char*)(c+i)) & 0xc0) & 0x80)
Explanation(with references):
761 When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object.
1124 When applied to an operand that has type char, unsigned char, or signed char, (or a qualified version thereof) the result is 1.
These two implies that unsigned char pointer points to the same byte as original signed char pointer.
回答3:
You appear to have something similar to:
signed char c[] = "\x7F\x80\xBF\xC0\xC1\xFF";
for (int i = 0; c[i] != '\0'; i++)
{
if (!((c[i] & 0xC0) & 0x80))
...
}
You are (correctly) concerned about sign extension of the signed char type. In practice, however, (c[i] & 0xC0) will convert the signed character to a (signed) int, but the & 0xC0 will discard any set bits in the more significant bytes; the result of the expression will be in the range 0x00 .. 0xFF. This will, I believe, apply whether you use sign-and-magnitude, one's complement or two's complement binary values. The detailed bit pattern you get for a specific signed character value varies depending on the underlying representation; but the overall conclusion that the result will be in the range 0x00 .. 0xFF is valid.
There is an easy resolution for that concern — cast the value of c[i] to an unsigned char before using it:
if (!(((unsigned char)c[i] & 0xC0) & 0x80))
The value c[i] is converted to an unsigned char before it is promoted to an int (or, the compiler might promote to int, then coerce to unsigned char, then promote the unsigned char back to int), and the unsigned value is used in the & operations.
Of course, the code is now merely redundant. Using & 0xC0 followed by & 0x80 is entirely equivalent to just & 0x80.
If you're processing UTF-8 data and looking for continuation bytes, the correct test is:
if (((unsigned char)c[i] & 0xC0) == 0x80)
回答4:
"Since I want to work with the original bit patterns from the file, how can I convert/cast my signed char to unsigned char so that the bit patterns remain unchanged?"
As someone already explained in a previous answer to your question on the same topic, any small integer type, be it signed or unsigned, will get promoted to the type int whenever used in an expression.
C11 6.3.1.1
"If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions."
Also, as explained in the same answer, integer literals are always of the type int.
Therefore, your expression will boil down to the pseudo code (int) & (int) & (int). The operations will be performed on three temporary int variables and the result will be of type int.
Now, if the original data contained bits that may be interpreted as sign bits for the specific signedness representation (in practice this will be two's complement on all systems), you will get problems. Because these bits will be preserved upon promotion from signed char to int.
And then the bit-wise & operator performs an AND on every single bit regardless of the contents of its integer operand (C11 6.5.10/3), be it signed or not. If you had data in the signed bits of your original signed char, it will now be lost. Because the integer literals (0xC0 or 0x80) will have no bits set that corresponds to the sign bits.
The solution is to prevent the sign bits from getting transferred to the "temporary int". One solution is to cast c[i] to unsigned char, which is completely well-defined (C11 6.3.1.3). This will tell the compiler that "the whole contents of this variable is an integer, there are no sign bits to be concerned about".
Better yet, make a habit of always using unsigned data in every form of bit manipulations. The purist, 100% safe, MISRA-C compliant way of re-writing your expression is this:
if ( ((uint8_t)c[i] & 0xc0u) & 0x80u) > 0u)
The u suffix actually enforces the expression to be of unsigned int, but it is good practice to always cast to the intended type. It tells the reader of the code "I actually know what I am doing and I also understand all weird implicit promotion rules in C".
And then if we know our hex, (0xc0 & 0x80) is pointless, it is always true. And x & 0xC0 & 0x80 is always the same as x & 0x80. Therefore simplify the expression to:
if ( ((uint8_t)c[i] & 0x80u) > 0u)
"Is there a list of these "implementation-defined aspects" anywhere"
Yes, the C standard conveniently lists them in Appendix J.3. The only implementation-defined aspect you encounter in this case though, is the signedness implementation of integers. Which in practice is always two's complement.
EDIT: The quoted text in the question is concerned with that the various bit-wise operators will produce implementation-defined results. This is just briefly mentioned as implementation-defined even in the appendix with no exact references. The actual chapter 6.5 doesn't say much regarding impl.defined behavior of & | etc. The only operators where it is explicitly mentioned is the << and >>, where left shifting a negative number is even undefined behavior, but right shifting it is implementation-defined.
来源:https://stackoverflow.com/questions/14233716/bitwise-and-on-signed-chars