I\'m looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ).
I have an implementat
A good increase in accuracy in my algorithm (implementation FastExpSse in the answer above) can be obtained at the cost of an integer subtraction and floating-point division by using FastExpSse(x/2)/FastExpSse(-x/2) instead of FastExpSse(x). The trick here is to set the shift parameter (298765 above) to zero so that the piecewise linear approximations in the numerator and denominator line up to give you substantial error cancellation. Roll it into a single function:
__m128 BetterFastExpSse (__m128 x)
{
const __m128 a = _mm_set1_ps ((1 << 22) / float(M_LN2)); // to get exp(x/2)
const __m128i b = _mm_set1_epi32 (127 * (1 << 23)); // NB: zero shift!
__m128i r = _mm_cvtps_epi32 (_mm_mul_ps (a, x));
__m128i s = _mm_add_epi32 (b, r);
__m128i t = _mm_sub_epi32 (b, r);
return _mm_div_ps (_mm_castsi128_ps (s), _mm_castsi128_ps (t));
}
(I'm not a hardware guy - how bad a performance killer is the division here?)
If you need exp(x) just to get y = tanh(x) (e.g. for neural networks), use FastExpSse with zero shift as follows:
a = FastExpSse(x);
b = FastExpSse(-x);
y = (a - b)/(a + b);
to get the same type of error cancellation benefit. The logistic function works similarly, using FastExpSse(x/2)/(FastExpSse(x/2) + FastExpSse(-x/2)) with zero shift. (This is just to show the principle - you obviously don't want to evaluate FastExpSse multiple times here, but roll it into a single function along the lines of BetterFastExpSse above.)
I did develop a series of higher-order approximations from this, ever more accurate but also slower. Unpublished but happy to collaborate if anyone wants to give them a spin.
And finally, for some fun: use in reverse gear to get FastLogSse. Chaining that with FastExpSse gives you both operator and error cancellation, and out pops a blazingly fast power function...