Fastest Implementation of the Natural Exponential Function Using SSE

前端 未结 4 1976
后悔当初
后悔当初 2020-11-28 10:17

I\'m looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ).

I have an implementat

4条回答
  •  南笙
    南笙 (楼主)
    2020-11-28 10:36

    Going back through my notes from way back then, I did explore ways to improve the accuracy without using division. I used the same reinterpret-as-float trick but applied a polynomial correction to the mantissa which was essentially calculated in 16-bit fixed-point arithmetic (the only way to do it fast back then).

    The cubic resp. quartic versions give you 4 resp. 5 significant digits of accuracy. There was no point increasing the order beyond that, as the noise of the low-precision arithmetic then starts to drown out the error of the polynomial approximation. Here are the plain C versions:

    #include 
    
    float fastExp3(register float x)  // cubic spline approximation
    {
        union { float f; int32_t i; } reinterpreter;
    
        reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23);
        int32_t m = (reinterpreter.i >> 7) & 0xFFFF;  // copy mantissa
        // empirical values for small maximum relative error (8.34e-5):
        reinterpreter.i +=
             ((((((((1277*m) >> 14) + 14825)*m) >> 14) - 79749)*m) >> 11) - 626;
        return reinterpreter.f;
    }
    
    float fastExp4(register float x)  // quartic spline approximation
    {
        union { float f; int32_t i; } reinterpreter;
    
        reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23);
        int32_t m = (reinterpreter.i >> 7) & 0xFFFF;  // copy mantissa
        // empirical values for small maximum relative error (1.21e-5):
        reinterpreter.i += (((((((((((3537*m) >> 16)
            + 13668)*m) >> 18) + 15817)*m) >> 14) - 80470)*m) >> 11);
        return reinterpreter.f;
    }
    

    The quartic one obeys (fastExp4(0f) == 1f), which can be important for fixed-point iteration algorithms.

    How efficient are these integer multiply-shift-add sequences in SSE? On architectures where float arithmetic is just as fast, one could use that instead, reducing the arithmetic noise. This would essentially yield cubic and quartic extensions of @njuffa's answer above.

提交回复
热议问题