A fast method to round a double to a 32-bit int explained

前端 未结 3 1508
谎友^
谎友^ 2020-11-28 17:18

When reading Lua\'s source code, I noticed that Lua uses a macro to round a double to a 32-bit int. I extracted the macro

相关标签:
3条回答
  • 2020-11-28 17:59

    This kind of "trick" comes from older x86 processors, using the 8087 intructions/interface for floating point. On these machines, there's an instruction for converting floating point to integer "fist", but it uses the current fp rounding mode. Unfortunately, the C spec requires that fp->int conversions truncate towards zero, while all other fp operations round to nearest, so doing an
    fp->int conversion requires first changing the fp rounding mode, then doing a fist, then restoring the fp rounding mode.

    Now on the original 8086/8087, this wasn't too bad, but on later processors that started to get super-scalar and out-of-order execution, altering the fp rounding mode generally seriales the CPU core and is quite expensive. So on a CPU like a Pentium-III or Pentium-IV, this overall cost is quite high -- a normal fp->int conversion is 10x or more expensive than this add+store+load trick.

    On x86-64, however, floating point is done with the xmm instructions, and the cost of converting
    fp->int is pretty small, so this "optimization" is likely slower than a normal conversion.

    0 讨论(0)
  • 2020-11-28 18:06

    Here is a simpler implementation of the above Lua trick:

    /**
     * Round to the nearest integer.
     * for tie-breaks: round half to even (bankers' rounding)
     * Only works for inputs in the range: [-2^51, 2^51]
     */
    inline double rint(double d)
    {
        double x = 6755399441055744.0;  // 2^51 + 2^52
        return d + x - x;
    }
    

    The trick works for numbers with absolute value < 2 ^ 51.

    This is a little program to test it: ideone.com

    #include <cstdio>
    
    int main()
    {
        // round to nearest integer
        printf("%.1f, %.1f\n", rint(-12345678.3), rint(-12345678.9));
    
        // test tie-breaking rule
        printf("%.1f, %.1f, %.1f, %.1f\n", rint(-24.5), rint(-23.5), rint(23.5), rint(24.5));      
        return 0;
    }
    
    // output:
    // -12345678.0, -12345679.0
    // -24.0, -24.0, 24.0, 24.0
    
    0 讨论(0)
  • A double is represented like this:

    double representation

    and it can be seen as two 32-bit integers; now, the int taken in all the versions of your code (supposing it's a 32-bit int) is the one on the right in the figure, so what you are doing in the end is just taking the lowest 32 bits of mantissa.


    Now, to the magic number; as you correctly stated, 6755399441055744 is 2^51 + 2^52; adding such a number forces the double to go into the "sweet range" between 2^52 and 2^53, which, as explained by Wikipedia here, has an interesting property:

    Between 252=4,503,599,627,370,496 and 253=9,007,199,254,740,992 the representable numbers are exactly the integers

    This follows from the fact that the mantissa is 52 bits wide.

    The other interesting fact about adding 251+252 is that it affects the mantissa only in the two highest bits - which are discarded anyway, since we are taking only its lowest 32 bits.


    Last but not least: the sign.

    IEEE 754 floating point uses a magnitude and sign representation, while integers on "normal" machines use 2's complement arithmetic; how is this handled here?

    We talked only about positive integers; now suppose we are dealing with a negative number in the range representable by a 32-bit int, so less (in absolute value) than (-2^31+1); call it -a. Such a number is obviously made positive by adding the magic number, and the resulting value is 252+251+(-a).

    Now, what do we get if we interpret the mantissa in 2's complement representation? It must be the result of 2's complement sum of (252+251) and (-a). Again, the first term affects only the upper two bits, what remains in the bits 0~50 is the 2's complement representation of (-a) (again, minus the upper two bits).

    Since reduction of a 2's complement number to a smaller width is done just by cutting away the extra bits on the left, taking the lower 32 bits gives us correctly (-a) in 32 bit, 2's complement arithmetic.

    0 讨论(0)
提交回复
热议问题