I am googling the question for past hour, but there are only points to Taylor Series or some sample code that is either too slow or does not compile at all. Well, most answe
Based on the idea of http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648 and some manual rewriting to improve the performance in a micro benchmark I ended up with the following cosine implementation which is used in a HPC physics simulation that is bottlenecked by repeated cos calls on a large number space. It's accurate enough and much faster than a lookup table, most notably no division is required.
template
inline T cos(T x) noexcept
{
constexpr T tp = 1./(2.*M_PI);
x *= tp;
x -= T(.25) + std::floor(x + T(.25));
x *= T(16.) * (std::abs(x) - T(.5));
#if EXTRA_PRECISION
x += T(.225) * x * (std::abs(x) - T(1.));
#endif
return x;
}
The Intel compiler at least is also smart enough in vectorizing this function when used in a loop.
If EXTRA_PRECISION is defined, the maximum error is about 0.00109 for the range -π to π, assuming T is double as it's usually defined in most C++ implementations. Otherwise, the maximum error is about 0.056 for the same range.