Fastest implementation of sine, cosine and square root in C++ (doesn't need to be much accurate)

问题

I am googling the question for past hour, but there are only points to Taylor Series or some sample code that is either too slow or does not compile at all. Well, most answer I've found over Google is "Google it, it's already asked", but sadly it's not...

I am profiling my game on low-end Pentium 4 and found out that ~85% of execution time is wasted on calculating sinus, cosinus and square root (from standard C++ library in Visual Studio), and this seems to be heavily CPU dependent (on my I7 the same functions got only 5% of execution time, and the game is waaaaaaaaaay faster). I cannot optimize this three functions out, nor calculate both sine and cosine in one pass (there interdependent), but I don't need too accurate results for my simulation, so I can live with faster approximation.

So, the question: What are the fastest way to calculate sine, cosine and square root for float in C++?

EDIT Lookup table are more painful as resulting Cache Miss is way more costly on modern CPU than Taylor Series. The CPUs are just so fast these days, and cache is not.

I made a mistake, I though that I need to calculate several factorials for Taylor Series, and I see now they can be implemented as constants.

So the updated question: is there any speedy optimization for square root as well?

EDIT2

I am using square root to calculate distance, not normalization - can't use fast inverse square root algorithm (as pointed in comment: http://en.wikipedia.org/wiki/Fast_inverse_square_root

EDIT3

I also cannot operate on squared distances, I need exact distance for calculations

回答1:

The fastest way is to pre-compute values an use a table like in this example:

Create sine lookup table in C++

BUT if you insist upon computing at runtime you can use the Taylor series expansion of sine or cosine...

For more on the Taylor series... http://en.wikipedia.org/wiki/Taylor_series

One of the keys to getting this to work well is pre-computing the factorials and truncating at a sensible number of terms. The factorials grow in the denominator very quickly, so you don't need to carry more than a few terms.

Also...Don't multiply your x^n from the start each time...e.g. multiply x^3 by x another two times, then that by another two to compute the exponents.

回答2:

First, the Taylor series is NOT the best/fastest way to implement sine/cos. It is also not the way professional libraries implement these trigonometric functions, and knowing the best numerical implementation allows you to tweak the accuracy to get speed more efficiently. In addition, this problem has already been extensively discussed in StackOverflow. Here is just one example.

Second, the big difference you see between old/new PCS is due to the fact that modern Intel architecture has explicit assembly code to calculate elementary trigonometric functions. It is quite hard to beat them on execution speed.

Finally, let's talk about the code on your old PC. Check gsl gnu scientific library (or numerical recipes) implementation, and you will see that they basically use Chebyshev Approximation Formula.

Chebyshev approximation converges faster, so you need to evaluate fewer terms. I won't write implementation details here because there are already very nice answers posted on StackOverflow. Check this one for example . Just tweak the number of terms on this series to change the balance between accuracy/speed.

For this kind of problem, if you want implementation details of some special function or numerical method, you should take a look on GSL code before any further action - GSL is THE STANDARD numerical library.

EDIT: you may improve the execution time by including aggressive floating point optimization flags in gcc/icc. This will decrease the precision, but it seems that is exactly what you want.

EDIT2: You can try to make a coarse sin grid and use gsl routine (gsl_interp_cspline_periodic for spline with periodic conditions) to spline that table (the spline will reduce the errors in comparison to a linear interpolation => you need less points on your table => less cache miss)!

回答3:

Here's the guaranteed fastest possible sine function in C++:

double FastSin(double x)
{
    return 0;
}

Oh, you wanted better accuracy than |1.0|? Well, here is a sine function that is similarly fast:

double FastSin(double x)
{
    return x;
}

This answer actually does not suck, when x is close to zero. For small x, sin(x) is approximately equal to x.

What, still not accurate enough for you? Well read on.

Engineers in the 1970s made some fantastic discoveries in this field, but new programmers are simply unaware that these methods exist, because they're not taught as part of standard computer science curricula.

You need to start by understanding that there is no "perfect" implementation of these functions for all applications. Therefore, superficial answers to questions like "which one is fastest" are guaranteed to be wrong.

Most people who ask this question don't understand the importance of the tradeoffs between performance and accuracy. In particular, you are going to have to make some choices regarding the accuracy of the calculations before you do anything else. How much error can you tolerate in the result? 10^-4? 10^-16?

Unless you can quantify the error in any method, don't use it. See all those random answers below mine, that post a bunch of random uncommented source code, without clearly documenting the algorithm used and its maximum error across the input range? That's strictly bush league. If you can't calculate the exact maximum error in your sine function, you can't write a sine function.

No one uses the Taylor series alone to approximate transcendentals in software. Except for certain highly specific cases, Taylor series generally approach the target slowly across common input ranges.

The algorithms that your grandparents used to calculate transcendentals efficiently, are collectively referred to as CORDIC and were simple enough to be implemented in hardware. Here is a well documented CORDIC implementation in C. CORDIC implementations, usually, require a very small lookup table, but most implementations don't even require that a hardware multiplier be available. Most CORDIC implementations let you trade off performance for accuracy, including the one I linked.

There's been a lot of incremental improvements to the original CORDIC algorithms over the years. For example, last year some researchers in Japan published an article on an improved CORDIC with better rotation angles, which reduces the operations required.

If you have hardware multipliers sitting around (and you almost certainly do), or if you can't afford a lookup table like CORDIC requires, you can always use a Chebyshev polynomial to do the same thing. Chebyshev polynomials require multiplies, but this is rarely a problem on modern hardware. We like Chebyshev polynomials because they have highly predictable maximum errors for a given approximation. The maximum of the last term in a Chebyshev polynomial, across your input range, bounds the error in the result. And this error gets smaller as the number of terms gets larger. Here's one example of a Chebyshev polynomial giving a sine approximation across a huge range, ignoring the natural symmetry of the sine function and just solving the approximation problem by throwing more coefficients at it. And here's an example of estimating a sine function to within 5 ULPs. Don't know what an ULP is? You should.

We also like Chebyshev polynomials because the error in the approximation is equally distributed across the range of outputs. If you're writing audio plugins or doing digital signal processing, Chebyshev polynomials give you a cheap and predictable dithering effect "for free."

If you want to find your own Chebyshev polynomial coefficients across a specific range, many math libraries call the process of finding those coefficients "Chebyshev fit" or something like that.

Square roots, then as now, are usually calculated with some variant of the Newton-Raphson algorithm, usually with a fixed number of iterations. Usually, when someone cranks out an "amazing new" algorithm for doing square roots, it's merely Newton-Raphson in disguise.

Newton-Raphson, CORDIC, and Chebyshev polynomials let you trade off speed for accuracy, so the answer can be just as imprecise as you want.

Lastly, when you've finished all your fancy benchmarking and micro-optimization, make sure that your "fast" version is actually faster than the library version. Here is a typical library implementation of fsin() bounded on the domain from -pi/4 to pi/4. And it just ain't that damned slow.

There are people who have dedicated their lives to solving these problems efficiently, and they have produced some fascinating results. When you're ready to join the old school, pick up a copy of Numerical Recipes.

tl:dr; go google "sine approximation" or "cosine approximation" or "square root approximation" or "approximation theory."

回答4:

For square root, there is an approach called bit shift.

A float number defined by IEEE-754 is using some certain bit represent describe times of multiple based 2. Some bits are for represent the base value.

float squareRoot(float x)
{
  unsigned int i = *(unsigned int*) &x;

  // adjust bias
  i  += 127 << 23;
  // approximation of square root
  i >>= 1;

  return *(float*) &i;
}

That's a constant time calculating the squar root

回答5:

Based on the idea of http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648 and some manual rewriting to improve the performance in a micro benchmark I ended up with the following cosine implementation which is used in a HPC physics simulation that is bottlenecked by repeated cos calls on a large number space. It's accurate enough and much faster than a lookup table, most notably no division is required.

template<typename T>
inline T cos(T x) noexcept
{
    constexpr T tp = 1./(2.*M_PI);
    x *= tp;
    x -= T(.25) + std::floor(x + T(.25));
    x *= T(16.) * (std::abs(x) - T(.5));
    #if EXTRA_PRECISION
    x += T(.225) * x * (std::abs(x) - T(1.));
    #endif
    return x;
}

The Intel compiler at least is also smart enough in vectorizing this function when used in a loop.

If EXTRA_PRECISION is defined, the maximum error is about 0.00109 for the range -π to π, assuming T is double as it's usually defined in most C++ implementations. Otherwise, the maximum error is about 0.056 for the same range.

回答6:

For x86, the hardware FP square root instructions are fast (sqrtss is sqrt Scalar Single-precision). Single precision is faster than double-precision, so definitely use float instead of double for code where you can afford to use less precision.

For 32bit code, you usually need compiler options to get it to do FP math with SSE instructions, rather than x87. (e.g. -mfpmath=sse)

To get C's sqrt() or sqrtf() functions to inline as just sqrtsd or sqrtss, you need to compile with -fno-math-errno. Having math functions set errno on NaN is generally considered a design mistake, but the standard requires it. Without that option, gcc inlines it but then does a compare+branch to see if the result was NaN, and if so calls the library function so it can set errno. If your program doesn't check errno after math functions, there is no danger in using -fno-math-errno.

You don't need any of the "unsafe" parts of -ffast-math to get sqrt and some other functions to inline better or at all, but -ffast-math can make a big difference (e.g. allowing the compiler to auto-vectorize in cases where that changes the result, because FP math isn't associative.

e.g. with gcc6.3 compiling float foo(float a){ return sqrtf(a); }

foo:    # with -O3 -fno-math-errno.
    sqrtss  xmm0, xmm0
    ret

foo:   # with just -O3
    pxor    xmm2, xmm2   # clang just checks for NaN, instead of comparing against zero.
    sqrtss  xmm1, xmm0
    ucomiss xmm2, xmm0
    ja      .L8          # take the slow path if 0.0 > a
    movaps  xmm0, xmm1
    ret

.L8:                     # errno-setting path
    sub     rsp, 24
    movss   DWORD PTR [rsp+12], xmm1   # store the sqrtss result because the x86-64 SysV ABI has no call-preserved xmm regs.
    call    sqrtf                      # call sqrtf just to set errno
    movss   xmm1, DWORD PTR [rsp+12]
    add     rsp, 24
    movaps  xmm0, xmm1    # extra mov because gcc reloaded into the wrong register.
    ret

gcc's code for the NaN case seems way over-complicated; it doesn't even use the sqrtf return value! Anyway, this is the kind of mess you actually get without -fno-math-errno, for every sqrtf() call site in your program. Mostly it's just code-bloat, and none of the .L8 block will ever run when taking the sqrt of a number >= 0.0, but there's still several extra instructions in the fast path.

If you know that your input to sqrt is non-zero, you can use the fast but very approximate reciprocal sqrt instruction, rsqrtps (or rsqrtss for the scalar version). One Newton-Raphson iteration brings it up to nearly the same precision as the hardware single-precision sqrt instruction, but not quite.

sqrt(x) = x * 1/sqrt(x), for x!=0, so you can calculate a sqrt with rsqrt and a multiply. These are both fast, even on P4 (was that still relevant in 2013)?

On P4, it may be worth using rsqrt + Newton iteration to replace a single sqrt, even if you don't need to divide anything by it.

See also an answer I wrote recently about handling zeroes when calculating sqrt(x) as x*rsqrt(x), with a Newton Iteration. I included some discussion of rounding error if you want to convert the FP value to an integer, and links to other relevant questions.

P4:

sqrtss: 23c latency, not pipelined
sqrtsd: 38c latency, not pipelined
fsqrt (x87): 43c latency, not pipelined
rsqrtss / mulss: 4c + 6c latency. Possibly one per 3c throughput, since they apparently don't need the same execution unit (mmx vs. fp).
SIMD packed versions are somewhat slower

Skylake:

sqrtss/sqrtps: 12c latency, one per 3c throughput
sqrtsd/sqrtpd: 15-16c latency, one per 4-6c throughput
fsqrt (x87): 14-21cc latency, one per 4-7c throughput
rsqrtss / mulss: 4c + 4c latency. One per 1c throughput.
SIMD 128b vector versions are the same speed. 256b vector versions are a bit higher latency, almost half throughput. The rsqrtss version has full performance for 256b vectors.

With a Newton Iteration, the rsqrt version is not much if at all faster.

Numbers from Agner Fog's experimental testing. See his microarch guides to understand what makes code run fast or slow. Also see links at the x86 tag wiki.

IDK how best to calculate sin/cos. I've read that the hardware fsin / fcos (and the only slightly slower fsincos that does both at once) are not the fastest way, but IDK what is.

回答7:

QT has fast implementations of sine (qFastSin) and cosine (qFastCos) that uses look up table with interpolation. I'm using it in my code and they are faster than std:sin/cos and precise enough for what I need (error ~= 0.01% I would guess):

https://code.woboq.org/qt5/qtbase/src/corelib/kernel/qmath.h.html#_Z8qFastSind

#define QT_SINE_TABLE_SIZE 256


inline qreal qFastSin(qreal x)
{
   int si = int(x * (0.5 * QT_SINE_TABLE_SIZE / M_PI)); // Would be more accurate with qRound, but slower.
   qreal d = x - si * (2.0 * M_PI / QT_SINE_TABLE_SIZE);
   int ci = si + QT_SINE_TABLE_SIZE / 4;
   si &= QT_SINE_TABLE_SIZE - 1;
   ci &= QT_SINE_TABLE_SIZE - 1;
   return qt_sine_table[si] + (qt_sine_table[ci] - 0.5 * qt_sine_table[si] * d) * d;
}

inline qreal qFastCos(qreal x)
{
   int ci = int(x * (0.5 * QT_SINE_TABLE_SIZE / M_PI)); // Would be more accurate with qRound, but slower.
   qreal d = x - ci * (2.0 * M_PI / QT_SINE_TABLE_SIZE);
   int si = ci + QT_SINE_TABLE_SIZE / 4;
   si &= QT_SINE_TABLE_SIZE - 1;
   ci &= QT_SINE_TABLE_SIZE - 1;
   return qt_sine_table[si] - (qt_sine_table[ci] + 0.5 * qt_sine_table[si] * d) * d;
}

The LUT and license can be found here: https://code.woboq.org/qt5/qtbase/src/corelib/kernel/qmath.cpp.html#qt_sine_table

回答8:

I use the following code to compute trigonometric functions in quadruple precision. The constant N determines the number of bits of precision required (for example N=26 will give single precision accuracy). Depending on desired accuracy, the precomputed storage can be small and will fit in the cache. It only requires addition and multiplication operations and is also very easy to vectorize.

The algorithm pre-computes sin and cos values for 0.5^i, i=1,...,N. Then, we can combine these precomputed values, to compute sin and cos for any angle up to a resolution of 0.5^N

template <class QuadReal_t>
QuadReal_t sin(const QuadReal_t a){
  const int N=128;
  static std::vector<QuadReal_t> theta;
  static std::vector<QuadReal_t> sinval;
  static std::vector<QuadReal_t> cosval;
  if(theta.size()==0){
    #pragma omp critical (QUAD_SIN)
    if(theta.size()==0){
      theta.resize(N);
      sinval.resize(N);
      cosval.resize(N);

      QuadReal_t t=1.0;
      for(int i=0;i<N;i++){
        theta[i]=t;
        t=t*0.5;
      }

      sinval[N-1]=theta[N-1];
      cosval[N-1]=1.0-sinval[N-1]*sinval[N-1]/2;
      for(int i=N-2;i>=0;i--){
        sinval[i]=2.0*sinval[i+1]*cosval[i+1];
        cosval[i]=sqrt(1.0-sinval[i]*sinval[i]);
      }
    }
  }

  QuadReal_t t=(a<0.0?-a:a);
  QuadReal_t sval=0.0;
  QuadReal_t cval=1.0;
  for(int i=0;i<N;i++){
    while(theta[i]<=t){
      QuadReal_t sval_=sval*cosval[i]+cval*sinval[i];
      QuadReal_t cval_=cval*cosval[i]-sval*sinval[i];
      sval=sval_;
      cval=cval_;
      t=t-theta[i];
    }
  }
  return (a<0.0?-sval:sval);
}

回答9:

This is an implementation of Taylor Series method previously given in akellehe's answer.

unsigned int Math::SIN_LOOP = 15;
unsigned int Math::COS_LOOP = 15;

// sin(x) = x - x^3/3! + x^5/5! - x^7/7! + ...
template <class T>
T Math::sin(T x)
{
    T Sum       = 0;
    T Power     = x;
    T Sign      = 1;
    const T x2  = x * x;
    T Fact      = 1.0;
    for (unsigned int i=1; i<SIN_LOOP; i+=2)
    {
        Sum     += Sign * Power / Fact;
        Power   *= x2;
        Fact    *= (i + 1) * (i + 2);
        Sign    *= -1.0;
    }
    return Sum;
}

// cos(x) = 1 - x^2/2! + x^4/4! - x^6/6! + ...
template <class T>
T Math::cos(T x)
{
    T Sum       = x;
    T Power     = x;
    T Sign      = 1.0;
    const T x2  = x * x;
    T Fact      = 1.0;
    for (unsigned int i=3; i<COS_LOOP; i+=2)
    {
        Power   *= x2;
        Fact    *= i * (i - 1);
        Sign    *= -1.0;
        Sum     += Sign * Power / Fact;
    }
    return Sum;
}

回答10:

Sharing my code, it's a 6th degree polynomial, nothing special but rearranged to avoid pows. On Core i7 this is 2.3 times slower than standard implementation, although a bit faster for [0..2*PI] range. For an old processor this could be an alternative to standard sin/cos.

/*
    On [-1000..+1000] range with 0.001 step average error is: +/- 0.000011, max error: +/- 0.000060
    On [-100..+100] range with 0.001 step average error is:   +/- 0.000009, max error: +/- 0.000034
    On [-10..+10] range with 0.001 step average error is:     +/- 0.000009, max error: +/- 0.000030
    Error distribution ensures there's no discontinuity.
*/

const double PI          = 3.141592653589793;
const double HALF_PI     = 1.570796326794897;
const double DOUBLE_PI   = 6.283185307179586;
const double SIN_CURVE_A = 0.0415896;
const double SIN_CURVE_B = 0.00129810625032;

double cos1(double x) {
    if (x < 0) {
        int q = -x / DOUBLE_PI;
        q += 1;
        double y = q * DOUBLE_PI;
        x = -(x - y);
    }
    if (x >= DOUBLE_PI) {
        int q = x / DOUBLE_PI;
        double y = q * DOUBLE_PI;
        x = x - y;
    }
    int s = 1;
    if (x >= PI) {
        s = -1;
        x -= PI;
    }
    if (x > HALF_PI) {
        x = PI - x;
        s = -s;
    }
    double z = x * x;
    double r = z * (z * (SIN_CURVE_A - SIN_CURVE_B * z) - 0.5) + 1.0;
    if (r > 1.0) r = r - 2.0;
    if (s > 0) return r;
    else return -r;
}

double sin1(double x) {
    return cos1(x - HALF_PI);
}

回答11:

Over 100000000 test, milianw answer is 2 time slower than std::cos implementation. However, you can manage to run it faster by doing the following steps:

->use float

->don't use floor but static_cast

->don't use abs but ternary conditional

->use #define constant for division

->use macro to avoid function call

// 1 / (2 * PI)
#define FPII 0.159154943091895
//PI / 2
#define PI2 1.570796326794896619

#define _cos(x)         x *= FPII;\
                        x -= .25f + static_cast<int>(x + .25f) - 1;\
                        x *= 16.f * ((x >= 0 ? x : -x) - .5f);
#define _sin(x)         x -= PI2; _cos(x);

Over 100000000 call to std::cos and _cos(x), std::cos run on ~14s vs ~3s for _cos(x) (a little bit more for _sin(x))

回答12:

So let me rephrase that, this idea comes from approximating the cosine & sine functions on an interval [-pi/4,+pi/4] with a bounded error using the Remez algorithm. Then using the range reduced float remainder and a LUT for the outputs cos & sine of the integer quotient, the approximation can be moved to any angular argument.

Its just unique and I thought it could be expanded on to make a more efficient algorithm in terms of a bounded error.

void sincos_fast(float x, float *pS, float *pC){
     float cosOff4LUT[] = { 0x1.000000p+00,  0x1.6A09E6p-01,  0x0.000000p+00, -0x1.6A09E6p-01, -0x1.000000p+00, -0x1.6A09E6p-01,  0x0.000000p+00,  0x1.6A09E6p-01 };

    int     m, ms, mc;
    float   xI, xR, xR2;
    float   c, s, cy, sy;

    // Cody & Waite's range reduction Algorithm, [-pi/4, pi/4]
    xI  = floorf(x * 0x1.45F306p+00 + 0.5);
    xR  = (x - xI * 0x1.920000p-01) - xI*0x1.FB5444p-13;
    m   = (int) xI;
    xR2 = xR*xR;

    // Find cosine & sine index for angle offsets indices
    mc = (  m  ) & 0x7;     // two's complement permits upper modulus for negative numbers =P
    ms = (m + 6) & 0x7;     // two's complement permits upper modulus for negative numbers =P, note phase correction for sine.

    // Find cosine & sine
    cy = cosOff4LUT[mc];     // Load angle offset neighborhood cosine value 
    sy = cosOff4LUT[ms];     // Load angle offset neighborhood sine value 

    c = 0xf.ff79fp-4 + xR2 * (-0x7.e58e9p-4);               // TOL = 1.2786e-4
    // c = 0xf.ffffdp-4 + xR2 * (-0x7.ffebep-4 + xR2 * 0xa.956a9p-8);  // TOL = 1.7882e-7

     s = xR * (0xf.ffbf7p-4 + x2 * (-0x2.a41d0cp-4));   // TOL = 4.835251e-6
    // s = xR * (0xf.fffffp-4 + xR2 * (-0x2.aaa65cp-4 + xR2 * 0x2.1ea25p-8));  // TOL = 1.1841e-8

     *pC = c*cy - s*sy;     
    *pS = c*sy + s*cy;
}

float sqrt_fast(float x){
    union {float f; int i; } X, Y;
    float ScOff;
    uint8_t e;

    X.f = x;
    e = (X.i >> 23);           // f.SFPbits.e;

    if(x <= 0) return(0.0f);

    ScOff = ((e & 1) != 0) ? 1.0f : 0x1.6a09e6p0;  // NOTE: If exp=EVEN, b/c (exp-127) a (EVEN - ODD) := ODD; but a (ODD - ODD) := EVEN!!

    e = ((e + 127) >> 1);                            // NOTE: If exp=ODD,  b/c (exp-127) then flr((exp-127)/2)
    X.i = (X.i & ((1uL << 23) - 1)) | (0x7F << 23);  // Mask mantissa, force exponent to zero.
    Y.i = (((uint32_t) e) << 23);

    // Error grows with square root of the exponent. Unfortunately no work around like inverse square root... :(
    // Y.f *= ScOff * (0x9.5f61ap-4 + X.f*(0x6.a09e68p-4));        // Error = +-1.78e-2 * 2^(flr(log2(x)/2))
    // Y.f *= ScOff * (0x7.2181d8p-4 + X.f*(0xa.05406p-4 + X.f*(-0x1.23a14cp-4)));      // Error = +-7.64e-5 * 2^(flr(log2(x)/2))
    // Y.f *= ScOff * (0x5.f10e7p-4 + X.f*(0xc.8f2p-4 +X.f*(-0x2.e41a4cp-4 + X.f*(0x6.441e6p-8))));     // Error =  8.21e-5 * 2^(flr(log2(x)/2))
    // Y.f *= ScOff * (0x5.32eb88p-4 + X.f*(0xe.abbf5p-4 + X.f*(-0x5.18ee2p-4 + X.f*(0x1.655efp-4 + X.f*(-0x2.b11518p-8)))));   // Error = +-9.92e-6 * 2^(flr(log2(x)/2))
    // Y.f *= ScOff * (0x4.adde5p-4 + X.f*(0x1.08448cp0 + X.f*(-0x7.ae1248p-4 + X.f*(0x3.2cf7a8p-4 + X.f*(-0xc.5c1e2p-8 + X.f*(0x1.4b6dp-8))))));   // Error = +-1.38e-6 * 2^(flr(log2(x)/2))
    // Y.f *= ScOff * (0x4.4a17fp-4 + X.f*(0x1.22d44p0 + X.f*(-0xa.972e8p-4 + X.f*(0x5.dd53fp-4 + X.f*(-0x2.273c08p-4 + X.f*(0x7.466cb8p-8 + X.f*(-0xa.ac00ep-12)))))));    // Error = +-2.9e-7 * 2^(flr(log2(x)/2))
    Y.f *= ScOff * (0x3.fbb3e8p-4 + X.f*(0x1.3b2a3cp0 + X.f*(-0xd.cbb39p-4 + X.f*(0x9.9444ep-4 + X.f*(-0x4.b5ea38p-4 + X.f*(0x1.802f9ep-4 + X.f*(-0x4.6f0adp-8 + X.f*(0x5.c24a28p-12 ))))))));   // Error = +-2.7e-6 * 2^(flr(log2(x)/2))

    return(Y.f);
}

回答13:

Just use the FPU with inline x86 for Wintel apps. The direct CPU sqrt function is reportedly still beating any other algorithms in speed. My custom x86 Math library code is for standard MSVC++ 2005 and forward. You need separate float/double versions if you want more precision which I covered. Sometimes the compiler's "__inline" strategy goes bad, so to be safe, you can remove it. With experience, you can switch to macros to totally avoid a function call each time.

extern __inline float  __fastcall fs_sin(float x);
extern __inline double __fastcall fs_Sin(double x);
extern __inline float  __fastcall fs_cos(float x);
extern __inline double __fastcall fs_Cos(double x);
extern __inline float  __fastcall fs_atan(float x);
extern __inline double __fastcall fs_Atan(double x);
extern __inline float  __fastcall fs_sqrt(float x);
extern __inline double __fastcall fs_Sqrt(double x);
extern __inline float  __fastcall fs_log(float x);
extern __inline double __fastcall fs_Log(double x);

extern __inline float __fastcall fs_sqrt(float x) { __asm {
FLD x  ;// Load/Push input value
FSQRT
}}

extern __inline double __fastcall fs_Sqrt(double x) { __asm {
FLD x  ;// Load/Push input value
FSQRT
}}

extern __inline float __fastcall fs_sin(float x) { __asm {
FLD x  ;// Load/Push input value
FSIN
}}

extern __inline double __fastcall fs_Sin(double x) { __asm {
FLD x  ;// Load/Push input value
FSIN
}}    

extern __inline float __fastcall fs_cos(float x) { __asm {
FLD x  ;// Load/Push input value
FCOS
}}

extern __inline double __fastcall fs_Cos(double x) { __asm {
FLD x  ;// Load/Push input value
FCOS
}}

extern __inline float __fastcall fs_tan(float x) { __asm {
FLD x  ;// Load/Push input value
FPTAN
}}

extern __inline double __fastcall fs_Tan(double x) { __asm {
FLD x  ;// Load/Push input value
FPTAN
}}

extern __inline float __fastcall fs_log(float x) { __asm {
FLDLN2
FLD x
FYL2X
FSTP ST(1) ;// Pop1, Pop2 occurs on return
}}

extern __inline double __fastcall fs_Log(double x) { __asm {
FLDLN2
FLD x
FYL2X
FSTP ST(1) ;// Pop1, Pop2 occurs on return
}}

来源：https://stackoverflow.com/questions/18662261/fastest-implementation-of-sine-cosine-and-square-root-in-c-doesnt-need-to-b

标签

c++

math

optimization

trigonometry