sign extension in C | 易学教程

问题

I'm looking here to understand sign extension: http://www.shrubbery.net/solaris9ab/SUNWdev/SOL64TRANS/p8.html

    struct foo {
        unsigned int    base:19, rehash:13;  
    };

    main(int argc, char *argv[]) 
    {
        struct foo  a;
        unsigned long addr;

        a.base = 0x40000;
        addr = a.base << 13;        /* Sign extension here! */
        printf("addr 0x%lx\n", addr);

        addr = (unsigned int)(a.base << 13);  /* No sign extension here! */
        printf("addr 0x%lx\n", addr);
    }

They claim this:

------------------ 64 bit:

% cc -o test64 -xarch=v9 test.c
% ./test64
addr 0xffffffff80000000
addr 0x80000000
%

------------------ 32 bit:

% cc -o test32 test.c
% ./test32
addr 0x80000000
addr 0x80000000
%

I have 3 questions:

What is sign extension ? Yes I read wiki, but didn't understand when type promotion occurs, what's going on with sign extension?
Why ffff.. in 64 bit(referring addr) ?
When I do type cast, why no sign extension?

EDIT: 4. Why not an issue in 32 bit system?

回答1:

a.base << 13

The bitwise operator performs integer promotions on both its operands.

So this is equivalent to:

    (int) a.base << 13

which is a negative value of type int.

Then:

addr = (int) a.base << 13;

converts this signed negative value ((int) a.base << 13) to the type of addr which is unsigned long through integer conversions.

Integer conversions (C99, 6.3.1.3p2) rules that is the same as doing:

addr = (long) ((int) a.base << 13);

The conversion long performs the sign extension here because ((int) a.base << 13) is a negative signed number.

On the other case, with a cast you have something equivalent to:

addr = (unsigned long) (unsigned int) ((int) a.base << 13);

so no sign extension is performed in your second case because (unsigned int) ((int) a.base << 13) is an unsigned (and positive of course) value.

EDIT: as KerrekSB mentioned in his answer a.base << 13 is actually not representable in an int (I assume 32-bit int) so this expression invokes undefined behavior and the implementation has he right to behave in any other way, for example crashing.

For information, this is definitely not portable but if you are using gcc, gcc does not consider a.base << 13 here as undefined behavior. From gcc documentation:

"GCC does not use the latitude given in C99 only to treat certain aspects of signed '<<' as undefined, but this is subject to change."

in http://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html

回答2:

The left operand of the << operator undergoes standard promotions, so in your case it is promoted to int -- so far so good. Next, the int of value 0x4000 is multiplied by 2¹³, which causes overflow and thus undefined behaviour. However, we can see what's happening: the value of the expression is now simply INT_MIN, the smallest representable int. Finally, when you convert that to an unsigned 64-bit integer, the usual modular arithmetic rules entail that the resulting value is 0xffffffff80000000. Similarly, converting to an unsigned 32-bit integer gives the value 0x80000000.

To perform the operation on unsigned values, you need to control the conversions with a cast:

(unsigned int)(a.base) << 13

回答3:

This is more of a question about bit-fields. Note that if you change the struct to

struct foo {
    unsigned int    base, rehash;  
};

you get very different results.

As @JensGustedt noted in Type of unsigned bit-fields: int or unsigned int the specification says:

If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int;

Even though you've specified that base is unsigned, the compiler converts it to a signed int when you read it. That's why you don't get sign extension when you cast it to unsigned int.

Sign extension has to do with how negative numbers are represented in binary. The most common scheme is 2s complement. In this scheme, -1 is represented in 32 bits as 0xFFFFFFFF, -2 is 0xFFFFFFFE, etc. So what should be done when we want to convert a 32-bit number to a 64-bit number, for example? If we convert 0xFFFFFFFF to 0x00000000FFFFFFFF, the numbers will have the same unsigned value (about 4 billion), but different signed values (-1 vs. 4 billion). On the other hand, if we convert 0xFFFFFFFF to 0xFFFFFFFFFFFFFFFF, the numbers will have the same signed value (-1) but different unsigned values. The former is called zero-extension (and is appropriate for unsigned numbers) and the latter is called sign-extension (and is appropriate for signed numbers). It's called "sign-extension" because the "sign bit" (the most significant, or left-most bit) is extended, or copied, to make the number wider.

回答4:

It took me a while and a lot of reading/testing.
Maybe my, beginner way to understand what's going on will get to you (as I got it)

a.base=0x40000 (1(0)x18) -> 19-bit bitfield
addr=a.base<<13.
- any value a.base can hold int can hold, too so conversion from 19-bit unsigned int bitfield to 32-bit signed integer. (a.base is now (0)x13,1,(0)x18).
- now (converted to signed int a.base)<<13 which results in 1(0)x31). Remember it's signed int now.
- addr=(1(0)x31). addr is of unsigned long type(64 bit) so to do the assignment righ value is converted to long int. Conversion from signed int to long int make addr (1)x33,(0)x31.

And that's what being printed after all of thos converstions you weren't even aware of: 0xffffffff80000000.
Why the second line prints 0x80000000 is because of that cast to (unsigned int) before conversion to long int. When converting unsigned int to long int there is no bit sign so value is just filled with trailing 0's to match the size and that's all.

What's different on with 32-bit, is during conversion from 32-bit signed int to 32-bit unsigned long their sizes match and do trailing bit signs are added,so: 1(0)x31 will stay 1(0)x31
even after conversion from int to long int(they have the same size, the value is interpreted different but bits are intact.)

Quotation from your link:

Any code that makes this assumption must be changed to work for both ILP32 and LP64. While an int and a long are both 32-bits in the ILP32 data model, in the LP64 data model, a long is 64-bits.

来源：https://stackoverflow.com/questions/19260052/sign-extension-in-c

标签

bit