Cast Integer to Float using Bit Manipulation breaks on some integers in C

倖福魔咒の 提交于 2019-12-11 08:00:59

问题


Working on a class assignment, I'm trying to cast an integer to a float only using bit manipulations (limited to any integer/unsigned operations incl. ||, &&. also if, while). My code is working for most values, but some values are not generating the results I'm looking for.

For example, if x is 0x807fffff, I get 0xceff0001, but the correct result should be 0xceff0000. I think I'm missing something with my mantissa and rounding, but can't quite pin it down. I've looked at some other threads on SO as well converting-int-to-float and how-to-manually

unsigned dl22(int x) {


    int tmin = 0x1 << 31;
    int tmax = ~tmin;

    unsigned signBit = 0;
    unsigned exponent;
    unsigned mantissa;
    int bias = 127;

    if (x == 0) {
        return 0;
    }

    if (x == tmin) {
        return 0xcf << 24;
    }

    if (x < 0) {
        signBit = x & tmin;
        x = (~x + 1);
    }


    exponent = bias + 31;

    while ( ( x & tmin) == 0 ) {
        exponent--;
        x <<= 1;
    }

    exponent <<= 23;
    int mantissaMask = ~(tmin >> 8);
    mantissa = (x >> 8) & mantissaMask;

    return (signBit | exponent | mantissa);
}

EDIT/UPDATE Found a viable solution - see below


回答1:


Your code produces the expected output for me on the example you presented. As discussed in comments, however, from C's perspective it does exhibit undefined behavior -- not just in the computation of tmin, but also, for the same reason, in the loop wherein you compute the exponent. To whatever extent this code produces results that vary from environment to environment, that will follow either from the undefined behavior or from your assumption about the size of [unsigned] int being incorrect for the C implementation in use.

Nevertheless, if we assume (unsafely)

  1. that shifts of ints operate as if the left operand were reinterpreted as an unsigned int with the same bit pattern, operated upon, and the resulting bit pattern reinterpreted as an int, and
  2. that int and unsigned int are at least 32 bits wide,

then your code seems correct, modulo rounding.

In the event that the absolute value of the input int has more than 24 significant binary digits (i.e. it is at least 224), however, some precision will be lost in the conversion. In that case the correct result will depend on the FP rounding mode you intend to implement. An incorrectly rounded result will be off by 1 unit in the last place; how many results that affects depends on the rounding mode.

Simply truncating / shifting off the extra bits as you do yields round toward zero mode. That's one of the standard rounding modes, but not the default. The default rounding mode is to round to the nearest representable number, with ties being resolved in favor of the result having least-significant bit 0 (round to even); there are also three other standard modes. To implement any mode other than round-toward-zero, you'll need to capture the 8 least-significant bits of the significand after scaling and before shifting them off. These, together with other details depending on the chosen rounding mode, will determine how to apply the correct rounding.

About half of the 32-bit two's complement numbers will be rounded differently when converted in round-to-zero mode than when converted in any one of the other modes; which numbers exhibit a discrepancy depends on which rounding mode you consider.




回答2:


I didn't originally mention that I am trying to imitate a U2F union statement:

float u2f(unsigned u) {
  union {
    unsigned u;
    float f;
  } a;
  a.u = u;
  return a.f;
}

Thanks to guidance provided in the postieee-754-bit-manipulation-rounding-error I was able to manage the rounding issues by putting the following after my while statement. This clarified the rounding that was occurring.

lsb = (x >> 8) & 1;
roundBit = (x >> 7) & 1;
stickyBitFlag = !!(x & 0x7F);

exponent <<= 23;

int mantissaMask = ~(tmin >> 8);
mantissa = (x >> 8);
mantissa &= mantissaMask;

roundBit = (roundBit & stickyBitFlag) | (roundBit & lsb);

return (signBit | exponent | mantissa) + roundBit;


来源:https://stackoverflow.com/questions/42013577/cast-integer-to-float-using-bit-manipulation-breaks-on-some-integers-in-c

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!