I\'m looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ).
I have an implementat
Going back through my notes from way back then, I did explore ways to improve the accuracy without using division. I used the same reinterpret-as-float trick but applied a polynomial correction to the mantissa which was essentially calculated in 16-bit fixed-point arithmetic (the only way to do it fast back then).
The cubic resp. quartic versions give you 4 resp. 5 significant digits of accuracy. There was no point increasing the order beyond that, as the noise of the low-precision arithmetic then starts to drown out the error of the polynomial approximation. Here are the plain C versions:
#include
float fastExp3(register float x) // cubic spline approximation
{
union { float f; int32_t i; } reinterpreter;
reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23);
int32_t m = (reinterpreter.i >> 7) & 0xFFFF; // copy mantissa
// empirical values for small maximum relative error (8.34e-5):
reinterpreter.i +=
((((((((1277*m) >> 14) + 14825)*m) >> 14) - 79749)*m) >> 11) - 626;
return reinterpreter.f;
}
float fastExp4(register float x) // quartic spline approximation
{
union { float f; int32_t i; } reinterpreter;
reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23);
int32_t m = (reinterpreter.i >> 7) & 0xFFFF; // copy mantissa
// empirical values for small maximum relative error (1.21e-5):
reinterpreter.i += (((((((((((3537*m) >> 16)
+ 13668)*m) >> 18) + 15817)*m) >> 14) - 80470)*m) >> 11);
return reinterpreter.f;
}
The quartic one obeys (fastExp4(0f) == 1f), which can be important for fixed-point iteration algorithms.
How efficient are these integer multiply-shift-add sequences in SSE? On architectures where float arithmetic is just as fast, one could use that instead, reducing the arithmetic noise. This would essentially yield cubic and quartic extensions of @njuffa's answer above.