what happens if you cast a big int to float

问题

this is a general question about what precisely happens when I cast a very big/small SIGNED integer to a floating point using gcc 4.4.

I see some weird behaviour when doing the casting. Here are some examples:

MUSTBE is obtained with this method:

float f = (float)x;
unsigned int r;
memcpy(&r, &f, sizeof(unsigned int));

./btest -f float_i2f -1 0x80800001
input:          10000000100000000000000000000001
absolute value: 01111111011111111111111111111111

exponent:       10011101
mantissa:       00000000011111101111111111111111  (right shifted absolute value)

EXPECT:         11001110111111101111111111111111  (sign|exponent|mantissa)
MUST BE:        11001110111111110000000000000000  (sign ok, exponent ok,
                                                     mantissa???)

./btest -f float_i2f -1 0x3f7fffe0

EXPECT:    01001110011111011111111111111111
MUST BE:   01001110011111100000000000000000

./btest -f float_i2f -1 0x80004999                                                                  


EXPECT:    11001110111111111111111101101100
MUST BE:   11001110111111111111111101101101    (<- 1 added at the end)

So what bothers me that the mantissa is in both examples different then if I just shift my integer value to the right. The zeros at the end for instance. Where do they come from?

I only see this behaviour on big/small values. Values in the range -2^24, 2^24 work fine.

I wonder if someone can enlighten me what happens here. What are the steps too take on very big/small values.

This is an add on question to : function to convert float to int (huge integers) which is not as general as this one here.

EDIT Code:

unsigned float_i2f(int x) {
  if (x == 0) return 0;
  /* get sign of x */
  int sign = (x>>31) & 0x1;

  /* absolute value of x */
  int a = sign ? ~x + 1 : x;

  /* calculate exponent */
  int e = 158;
  int t = a;
  while (!(t >> 31) & 0x1) {
    t <<= 1;
    e--;
  };

  /* calculate mantissa */
  int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
  m &= 0x7fffff;

  int res = sign << 31;
  res |= (e << 23);
  res |= m;

  return res;
}

EDIT 2:

After Adams remarks and the reference to the book Write Great Code, I updated my routine with rounding. Still I get some rounding errors (now fortunately only 1 bit off).

Now if I do a test run, I get mostly good results but a couple of rounding errors like this:

input:  0xfefffff5
result: 11001011100000000000000000000101
GOAL:   11001011100000000000000000000110  (1 too low)

input:  0x7fffff
result: 01001010111111111111111111111111
GOAL:   01001010111111111111111111111110  (1 too high)

unsigned float_i2f(int x) {
  if (x == 0) return 0;
  /* get sign of x */
  int sign = (x>>31) & 0x1;

  /* absolute value of x */
  int a = sign ? ~x + 1 : x;

  /* calculate exponent */
  int e = 158;
  int t = a;
  while (!(t >> 31) & 0x1) {
    t <<= 1;
    e--;
  };

  /* mask to check which bits get shifted out when rounding */
  static unsigned masks[24] = {
    0, 1, 3, 7, 
    0xf, 0x1f, 
    0x3f, 0x7f, 
    0xff, 0x1ff, 
    0x3ff, 0x7ff, 
    0xfff, 0x1fff, 
    0x3fff, 0x7fff, 
    0xffff, 0x1ffff, 
    0x3ffff, 0x7ffff, 
    0xfffff, 0x1fffff, 
    0x3fffff, 0x7fffff
  };

  /* mask to check wether round up, or down */
  static unsigned HOmasks[24] = {
    0,
    1, 2, 4, 0x8, 0x10, 0x20, 0x40, 0x80,
    0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000, 0x10000, 0x20000, 0x40000, 0x80000, 0x100000, 0x200000, 0x400000
  };

  int S = a & masks[8];
  int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
  m &= 0x7fffff;

  if (S > HOmasks[8]) {
    /* round up */
    m += 1;
  } else if (S == HOmasks[8]) {
    /* round down */
    m = m + (m & 1);
  }

  /* special case where last bit of exponent is also set in mantissa
   * and mantissa itself is 0 */
  if (m & (0x1 << 23)) {
    e += 1;
    m = 0;
  }

  int res = sign << 31;
  res |= (e << 23);
  res |= m;
  return res;
}

Does someone have any idea where the problem lies?

回答1:

C/C++ floats tend to be compatible with the IEEE 754 floating point standard (e.g. in gcc). The zeros come from the rounding rules.

Shifting a number to the right makes some bits from the right-hand side go away. Let's call them guard bits. Now let's call HO the highest bit and LO the lowest bit of our number. Now suppose that the guard bits are still a part of our number. If, for example, we have 3 guard bits it means that the value of our LO bit is 8 (if it is set). Now if:

value of guard bits > 0.5 * value of LO

rounds the number to the smalling possible greater value, ignoring the sign
value of 'guard bits' == 0.5 * value of LO
- use current number value if LO == 0
- number += 1 otherwise
value of guard bits < 0.5 * value of LO
- use current number value

why do 3 guard bits mean the LO value is 8 ?

Suppose we have a binary 8 bit number:

weights:    128 64 32 16 8 4 2 1
binary num:   0  0  0  0 1 1 1 1

Let's shift it right by 3 bits:

weights:      x x x 128 64 32 16 8 | 4 2 1
binary num:   0 0 0   0  0  0  0 1 | 1 1 1

As you see, with 3 guard bits the LO bit ends up at the 4th position and has a weight of 8. It is true only for the purpose of rounding. The weights have to be 'normalized' afterwards, so that the weight of LO bit becomes 1 again.

And how can I check with bit operations if guard bits > 0.5 * value ??

The fastest way is to employ lookup tables. Suppose we're working on an 8 bit number:

unsigned number;          //our number
unsigned bitsToShift;     //number of bits to shift

assert(bitsToShift < 8);  //8 bits

unsigned guardMasks[8] = {0, 1, 3, 7, 0xf, 0x1f, 0x3f}
unsigned LOvalues[8] = {0, 1, 2, 4, 0x8, 0x10, 0x20, 0x40} //divided by 2 for faster comparison

unsigned guardBits = number & guardMasks[bitsToShift]; //value of the guard bits
number = number >> bitsToShift;

if(guardBits > LOvalues[bitsToShift]) {
...
} else if (guardBits == LOvalues[bitsToShift]) {
...
} else { //guardBits < LOvalues[bitsToShift]
...
}

Reference: Write Great Code, Volume 1 by Randall Hyde

回答2:

A 32-bit float uses some of the bits for the exponent and therefore cannot represent all 32-bit integer values exactly.

A 64-bitdouble can store any 32-bit integer value exactly.

Wikipedia has an abbreviated entry on IEEE 754 floating point, and lots of details of the internals of floating point numbers at IEEE 754-1985 — the current standard is IEEE 754:2008. It notes that a 32-bit float uses one bit for the sign, 8 bits for the exponent, leaving 23 explicit and 1 implicit bit for the mantissa, which is why absolute values up to 2²⁴ can be represented exactly.

I thought that it was clear that a 32 bit integer can't be exactly stored into a 32bit float. My question is: What happens IF I store an integer bigger 2^24 or smaller -2^24? And how can I replicate it?

Once the absolute values are larger than 2²⁴, the integer values cannot be represented exactly in the 24 effective digits of the mantissa of a 32-bit float, so only the leading 24 digits are reliably available. Floating point rounding also kicks in.

You can demonstrate with code similar to this: #include #include

typedef union Ufloat
{
    uint32_t    i;
    float       f;
} Ufloat;

static void dump_value(uint32_t i, uint32_t v)
{
    Ufloat u = { .i = v };
    printf("0x%.8" PRIX32 ": 0x%.8" PRIX32 " = %15.7e = %15.6A\n", i, v, u.f, u.f);
}

int main(void)
{
    uint32_t lo = 1 << 23;
    uint32_t hi = 1 << 28;
    Ufloat u;

    for (uint32_t v = lo; v < hi; v <<= 1)
    {
        u.f = v;
        dump_value(v, u.i);
    }

    lo = (1 << 24) - 16;
    hi = lo + 64;

    for (uint32_t v = lo; v < hi; v++)
    {
        u.f = v;
        dump_value(v, u.i);
    }

    return 0;
}

Sample output:

0x00800000: 0x4B000000 =   8.3886080e+06 =  0X1.000000P+23
0x01000000: 0x4B800000 =   1.6777216e+07 =  0X1.000000P+24
0x02000000: 0x4C000000 =   3.3554432e+07 =  0X1.000000P+25
0x04000000: 0x4C800000 =   6.7108864e+07 =  0X1.000000P+26
0x08000000: 0x4D000000 =   1.3421773e+08 =  0X1.000000P+27
0x00FFFFF0: 0x4B7FFFF0 =   1.6777200e+07 =  0X1.FFFFE0P+23
0x00FFFFF1: 0x4B7FFFF1 =   1.6777201e+07 =  0X1.FFFFE2P+23
0x00FFFFF2: 0x4B7FFFF2 =   1.6777202e+07 =  0X1.FFFFE4P+23
0x00FFFFF3: 0x4B7FFFF3 =   1.6777203e+07 =  0X1.FFFFE6P+23
0x00FFFFF4: 0x4B7FFFF4 =   1.6777204e+07 =  0X1.FFFFE8P+23
0x00FFFFF5: 0x4B7FFFF5 =   1.6777205e+07 =  0X1.FFFFEAP+23
0x00FFFFF6: 0x4B7FFFF6 =   1.6777206e+07 =  0X1.FFFFECP+23
0x00FFFFF7: 0x4B7FFFF7 =   1.6777207e+07 =  0X1.FFFFEEP+23
0x00FFFFF8: 0x4B7FFFF8 =   1.6777208e+07 =  0X1.FFFFF0P+23
0x00FFFFF9: 0x4B7FFFF9 =   1.6777209e+07 =  0X1.FFFFF2P+23
0x00FFFFFA: 0x4B7FFFFA =   1.6777210e+07 =  0X1.FFFFF4P+23
0x00FFFFFB: 0x4B7FFFFB =   1.6777211e+07 =  0X1.FFFFF6P+23
0x00FFFFFC: 0x4B7FFFFC =   1.6777212e+07 =  0X1.FFFFF8P+23
0x00FFFFFD: 0x4B7FFFFD =   1.6777213e+07 =  0X1.FFFFFAP+23
0x00FFFFFE: 0x4B7FFFFE =   1.6777214e+07 =  0X1.FFFFFCP+23
0x00FFFFFF: 0x4B7FFFFF =   1.6777215e+07 =  0X1.FFFFFEP+23
0x01000000: 0x4B800000 =   1.6777216e+07 =  0X1.000000P+24
0x01000001: 0x4B800000 =   1.6777216e+07 =  0X1.000000P+24
0x01000002: 0x4B800001 =   1.6777218e+07 =  0X1.000002P+24
0x01000003: 0x4B800002 =   1.6777220e+07 =  0X1.000004P+24
0x01000004: 0x4B800002 =   1.6777220e+07 =  0X1.000004P+24
0x01000005: 0x4B800002 =   1.6777220e+07 =  0X1.000004P+24
0x01000006: 0x4B800003 =   1.6777222e+07 =  0X1.000006P+24
0x01000007: 0x4B800004 =   1.6777224e+07 =  0X1.000008P+24
0x01000008: 0x4B800004 =   1.6777224e+07 =  0X1.000008P+24
0x01000009: 0x4B800004 =   1.6777224e+07 =  0X1.000008P+24
0x0100000A: 0x4B800005 =   1.6777226e+07 =  0X1.00000AP+24
0x0100000B: 0x4B800006 =   1.6777228e+07 =  0X1.00000CP+24
0x0100000C: 0x4B800006 =   1.6777228e+07 =  0X1.00000CP+24
0x0100000D: 0x4B800006 =   1.6777228e+07 =  0X1.00000CP+24
0x0100000E: 0x4B800007 =   1.6777230e+07 =  0X1.00000EP+24
0x0100000F: 0x4B800008 =   1.6777232e+07 =  0X1.000010P+24
0x01000010: 0x4B800008 =   1.6777232e+07 =  0X1.000010P+24
0x01000011: 0x4B800008 =   1.6777232e+07 =  0X1.000010P+24
0x01000012: 0x4B800009 =   1.6777234e+07 =  0X1.000012P+24
0x01000013: 0x4B80000A =   1.6777236e+07 =  0X1.000014P+24
0x01000014: 0x4B80000A =   1.6777236e+07 =  0X1.000014P+24
0x01000015: 0x4B80000A =   1.6777236e+07 =  0X1.000014P+24
0x01000016: 0x4B80000B =   1.6777238e+07 =  0X1.000016P+24
0x01000017: 0x4B80000C =   1.6777240e+07 =  0X1.000018P+24
0x01000018: 0x4B80000C =   1.6777240e+07 =  0X1.000018P+24
0x01000019: 0x4B80000C =   1.6777240e+07 =  0X1.000018P+24
0x0100001A: 0x4B80000D =   1.6777242e+07 =  0X1.00001AP+24
0x0100001B: 0x4B80000E =   1.6777244e+07 =  0X1.00001CP+24
0x0100001C: 0x4B80000E =   1.6777244e+07 =  0X1.00001CP+24
0x0100001D: 0x4B80000E =   1.6777244e+07 =  0X1.00001CP+24
0x0100001E: 0x4B80000F =   1.6777246e+07 =  0X1.00001EP+24
0x0100001F: 0x4B800010 =   1.6777248e+07 =  0X1.000020P+24
0x01000020: 0x4B800010 =   1.6777248e+07 =  0X1.000020P+24
0x01000021: 0x4B800010 =   1.6777248e+07 =  0X1.000020P+24
0x01000022: 0x4B800011 =   1.6777250e+07 =  0X1.000022P+24
0x01000023: 0x4B800012 =   1.6777252e+07 =  0X1.000024P+24
0x01000024: 0x4B800012 =   1.6777252e+07 =  0X1.000024P+24
0x01000025: 0x4B800012 =   1.6777252e+07 =  0X1.000024P+24
0x01000026: 0x4B800013 =   1.6777254e+07 =  0X1.000026P+24
0x01000027: 0x4B800014 =   1.6777256e+07 =  0X1.000028P+24
0x01000028: 0x4B800014 =   1.6777256e+07 =  0X1.000028P+24
0x01000029: 0x4B800014 =   1.6777256e+07 =  0X1.000028P+24
0x0100002A: 0x4B800015 =   1.6777258e+07 =  0X1.00002AP+24
0x0100002B: 0x4B800016 =   1.6777260e+07 =  0X1.00002CP+24
0x0100002C: 0x4B800016 =   1.6777260e+07 =  0X1.00002CP+24
0x0100002D: 0x4B800016 =   1.6777260e+07 =  0X1.00002CP+24
0x0100002E: 0x4B800017 =   1.6777262e+07 =  0X1.00002EP+24
0x0100002F: 0x4B800018 =   1.6777264e+07 =  0X1.000030P+24

The first part of the output demonstrates that some integer values can still be stored exactly; specifically, powers of 2 can be stored exactly. In fact, more precisely (but less concisely), any integer where binary representation of the absolute value has no more than 24 significant digits (any trailing digits are zeros) can be represented exactly. The values can't necessarily be printed exactly, but that's a separate issue from storing them exactly.

The second (larger) part of the output demonstrates that up to 2²⁴-1, the integer values can be represented exactly. The value of 2²⁴ itself is also exactly representable, but 2²⁴+1 is not, so it appears the same as 2²⁴. By contrast, 2²⁴+2 can be represented with just 24 binary digits followed by 1 zero and hence can be represented exactly. Repeat ad nauseam for increments larger than 2. It looks as though 'round even' mode is in effect; that's why the results show 1 value then 3 values.

(I note in passing that there isn't a way to stipulate that the double passed to printf() — converted from float by the rules of default argument promotions (ISO/IEC 9899:2011 §6.5.2.2 Function calls, ¶6) be printed as a float() — the h modifier would logically be used, but is not defined.)

来源：https://stackoverflow.com/questions/25701319/what-happens-if-you-cast-a-big-int-to-float

标签

bit-manipulation