Fastest implementation of sine, cosine and square root in C++ (doesn't need to be much accurate)

前端未结

关注

 15  1974

I am googling the question for past hour, but there are only points to Taylor Series or some sample code that is either too slow or does not compile at all. Well, most answe

相关标签:

15条回答

萌比男神i

2020-12-04 11:35

This is a sinus implementation that should be quite fast, it works like this:

it has an arithemtical implementation of square rooting complex numbers

from analitical math with complex numbers you know that the angle is halfed when a complex number is square rooted

You can take a complex number whose angle you already know (e.g. i, has angle 90 degrees or PI / 2 radians)

Than by square rooting it you can get complex numbers of form cos (90 / 2^n) + i sin (90 / 2^n)

from analitical math with complex numbers you know that when two numbers multiply their angles add up

you can show the number k (one you get as an argument in sin or cos) as sum of angles 90 / 2^n and then get the resulting values by multiplying those complex numbers you precomputed

result will be in form cos k + i sin k

#define PI 3.14159265
#define complex pair <float, float>

/* this is square root function, uses binary search and halves mantisa */

float sqrt(float a) {

    float b = a;

    int *x = (int*) (&b); // here I get integer pointer to float b which allows me to directly change bits from float reperesentation

    int c = ((*x >> 23) & 255) - 127; // here I get mantisa, that is exponent of 2 (floats are like scientific notation 1.111010101... * 2^n)

    if(c < 0) c = -((-c) >> 1); // ---
                                //   |--> This is for halfing the mantisa
    else c >>= 1;               // ---

    *x &= ~(255 << 23); // here space reserved for mantisa is filled with 0s

    *x |= (c + 127) << 23; // here new mantisa is put in place

    for(int i = 0; i < 5; i++) b = (b + a / b) / 2; // here normal square root approximation runs 5 times (I assume even 2 or 3 would be enough)

    return b;
}

/* this is a square root for complex numbers (I derived it in paper), you'll need it later */

complex croot(complex x) {

    float c = x.first, d = x.second;

    return make_pair(sqrt((c + sqrt(c * c + d * d)) / 2), sqrt((-c + sqrt(c * c + d * d)) / 2) * (d < 0 ? -1 : 1));
}

/* this is for multiplying complex numbers, you'll also need it later */

complex mul(complex x, complex y) {

    float a = x.first, b = x.second, c = y.first, d = y.second;

    return make_pair(a * c - b * d, a * d + b * c);
}

/* this function calculates both sinus and cosinus */

complex roots[24];

float angles[24];

void init() {

    complex c = make_pair(-1, 0); // first number is going to be -1

    float alpha = PI; // angle of -1 is PI

    for(int i = 0; i < 24; i++) {

        roots[i] = c; // save current c

        angles[i] = alpha; // save current angle

        c = croot(c); // root c

        alpha *= 0.5; // halve alpha
    }
}

complex cosin(float k) {

    complex r = make_pair(1, 0); // at start 1

    for(int i = 0; i < 24; i++) {

        if(k >= angles[i]) { // if current k is bigger than angle of c

            k -= angles[i]; // reduce k by that number

            r = mul(r, roots[i]); // multiply the result by c
        }
    }

    return r; // here you'll have a complex number equal to cos k + i sin k.
}

float sin(float k) {

    return cosin(k).second;
}

float cos(float k) {

    return cosin(k).first;
}

Now if you still find it slow you can reduce number of iterations in function cosin (note that the precision will be reduced)

0 讨论(0)

傲寒

2020-12-04 11:35
Over 100000000 test, milianw answer is 2 time slower than std::cos implementation. However, you can manage to run it faster by doing the following steps:

->use float

->don't use floor but static_cast

->don't use abs but ternary conditional

->use #define constant for division

->use macro to avoid function call
```
// 1 / (2 * PI)
#define FPII 0.159154943091895
//PI / 2
#define PI2 1.570796326794896619

#define _cos(x)         x *= FPII;\
                        x -= .25f + static_cast<int>(x + .25f) - 1;\
                        x *= 16.f * ((x >= 0 ? x : -x) - .5f);
#define _sin(x)         x -= PI2; _cos(x);
```
Over 100000000 call to std::cos and _cos(x), std::cos run on ~14s vs ~3s for _cos(x) (a little bit more for _sin(x))
0 讨论(0)
发布评论:

提交评论
- 加载中...

广开言路

2020-12-04 11:36

Just use the FPU with inline x86 for Wintel apps. The direct CPU sqrt function is reportedly still beating any other algorithms in speed. My custom x86 Math library code is for standard MSVC++ 2005 and forward. You need separate float/double versions if you want more precision which I covered. Sometimes the compiler's "__inline" strategy goes bad, so to be safe, you can remove it. With experience, you can switch to macros to totally avoid a function call each time.

extern __inline float  __fastcall fs_sin(float x);
extern __inline double __fastcall fs_Sin(double x);
extern __inline float  __fastcall fs_cos(float x);
extern __inline double __fastcall fs_Cos(double x);
extern __inline float  __fastcall fs_atan(float x);
extern __inline double __fastcall fs_Atan(double x);
extern __inline float  __fastcall fs_sqrt(float x);
extern __inline double __fastcall fs_Sqrt(double x);
extern __inline float  __fastcall fs_log(float x);
extern __inline double __fastcall fs_Log(double x);

extern __inline float __fastcall fs_sqrt(float x) { __asm {
FLD x  ;// Load/Push input value
FSQRT
}}

extern __inline double __fastcall fs_Sqrt(double x) { __asm {
FLD x  ;// Load/Push input value
FSQRT
}}

extern __inline float __fastcall fs_sin(float x) { __asm {
FLD x  ;// Load/Push input value
FSIN
}}

extern __inline double __fastcall fs_Sin(double x) { __asm {
FLD x  ;// Load/Push input value
FSIN
}}    

extern __inline float __fastcall fs_cos(float x) { __asm {
FLD x  ;// Load/Push input value
FCOS
}}

extern __inline double __fastcall fs_Cos(double x) { __asm {
FLD x  ;// Load/Push input value
FCOS
}}

extern __inline float __fastcall fs_tan(float x) { __asm {
FLD x  ;// Load/Push input value
FPTAN
}}

extern __inline double __fastcall fs_Tan(double x) { __asm {
FLD x  ;// Load/Push input value
FPTAN
}}

extern __inline float __fastcall fs_log(float x) { __asm {
FLDLN2
FLD x
FYL2X
FSTP ST(1) ;// Pop1, Pop2 occurs on return
}}

extern __inline double __fastcall fs_Log(double x) { __asm {
FLDLN2
FLD x
FYL2X
FSTP ST(1) ;// Pop1, Pop2 occurs on return
}}

0 讨论(0)

误落风尘

2020-12-04 11:39

The fastest way is to pre-compute values an use a table like in this example:

Create sine lookup table in C++

BUT if you insist upon computing at runtime you can use the Taylor series expansion of sine or cosine...

For more on the Taylor series... http://en.wikipedia.org/wiki/Taylor_series

One of the keys to getting this to work well is pre-computing the factorials and truncating at a sensible number of terms. The factorials grow in the denominator very quickly, so you don't need to carry more than a few terms.

Also...Don't multiply your x^n from the start each time...e.g. multiply x^3 by x another two times, then that by another two to compute the exponents.

0 讨论(0)
发布评论:

提交评论
- 加载中...

悲&欢浪女

2020-12-04 11:39

QT has fast implementations of sine (qFastSin) and cosine (qFastCos) that uses look up table with interpolation. I'm using it in my code and they are faster than std:sin/cos and precise enough for what I need:

https://code.woboq.org/qt5/qtbase/src/corelib/kernel/qmath.h.html#_Z8qFastSind

#define QT_SINE_TABLE_SIZE 256


inline qreal qFastSin(qreal x)
{
   int si = int(x * (0.5 * QT_SINE_TABLE_SIZE / M_PI)); // Would be more accurate with qRound, but slower.
   qreal d = x - si * (2.0 * M_PI / QT_SINE_TABLE_SIZE);
   int ci = si + QT_SINE_TABLE_SIZE / 4;
   si &= QT_SINE_TABLE_SIZE - 1;
   ci &= QT_SINE_TABLE_SIZE - 1;
   return qt_sine_table[si] + (qt_sine_table[ci] - 0.5 * qt_sine_table[si] * d) * d;
}

inline qreal qFastCos(qreal x)
{
   int ci = int(x * (0.5 * QT_SINE_TABLE_SIZE / M_PI)); // Would be more accurate with qRound, but slower.
   qreal d = x - ci * (2.0 * M_PI / QT_SINE_TABLE_SIZE);
   int si = ci + QT_SINE_TABLE_SIZE / 4;
   si &= QT_SINE_TABLE_SIZE - 1;
   ci &= QT_SINE_TABLE_SIZE - 1;
   return qt_sine_table[si] - (qt_sine_table[ci] + 0.5 * qt_sine_table[si] * d) * d;
}

The LUT and license can be found here: https://code.woboq.org/qt5/qtbase/src/corelib/kernel/qmath.cpp.html#qt_sine_table

These pair of functions take radian inputs. The LUT covers the entire 2π input range. The function interpolates between values using the difference d, using the cosine (with a similar interpolation using sine again) as the derivative.

0 讨论(0)

我寻月下人不归

2020-12-04 11:44
For x86, the hardware FP square root instructions are fast (sqrtss is sqrt Scalar Single-precision). Single precision is faster than double-precision, so definitely use float instead of double for code where you can afford to use less precision.

For 32bit code, you usually need compiler options to get it to do FP math with SSE instructions, rather than x87. (e.g. -mfpmath=sse)

To get C's sqrt() or sqrtf() functions to inline as just sqrtsd or sqrtss, you need to compile with -fno-math-errno. Having math functions set errno on NaN is generally considered a design mistake, but the standard requires it. Without that option, gcc inlines it but then does a compare+branch to see if the result was NaN, and if so calls the library function so it can set errno. If your program doesn't check errno after math functions, there is no danger in using -fno-math-errno.

You don't need any of the "unsafe" parts of -ffast-math to get sqrt and some other functions to inline better or at all, but -ffast-math can make a big difference (e.g. allowing the compiler to auto-vectorize in cases where that changes the result, because FP math isn't associative.

e.g. with gcc6.3 compiling float foo(float a){ return sqrtf(a); }
```
foo:    # with -O3 -fno-math-errno.
    sqrtss  xmm0, xmm0
    ret
```
```
foo:   # with just -O3
    pxor    xmm2, xmm2   # clang just checks for NaN, instead of comparing against zero.
    sqrtss  xmm1, xmm0
    ucomiss xmm2, xmm0
    ja      .L8          # take the slow path if 0.0 > a
    movaps  xmm0, xmm1
    ret

.L8:                     # errno-setting path
    sub     rsp, 24
    movss   DWORD PTR [rsp+12], xmm1   # store the sqrtss result because the x86-64 SysV ABI has no call-preserved xmm regs.
    call    sqrtf                      # call sqrtf just to set errno
    movss   xmm1, DWORD PTR [rsp+12]
    add     rsp, 24
    movaps  xmm0, xmm1    # extra mov because gcc reloaded into the wrong register.
    ret
```
gcc's code for the NaN case seems way over-complicated; it doesn't even use the sqrtf return value! Anyway, this is the kind of mess you actually get without -fno-math-errno, for every sqrtf() call site in your program. Mostly it's just code-bloat, and none of the .L8 block will ever run when taking the sqrt of a number >= 0.0, but there's still several extra instructions in the fast path.

If you know that your input to sqrt is non-zero, you can use the fast but very approximate reciprocal sqrt instruction, rsqrtps (or rsqrtss for the scalar version). One Newton-Raphson iteration brings it up to nearly the same precision as the hardware single-precision sqrt instruction, but not quite.

sqrt(x) = x * 1/sqrt(x), for x!=0, so you can calculate a sqrt with rsqrt and a multiply. These are both fast, even on P4 (was that still relevant in 2013)?

On P4, it may be worth using rsqrt + Newton iteration to replace a single sqrt, even if you don't need to divide anything by it.

See also an answer I wrote recently about handling zeroes when calculating sqrt(x) as x*rsqrt(x), with a Newton Iteration. I included some discussion of rounding error if you want to convert the FP value to an integer, and links to other relevant questions.

P4:
- sqrtss: 23c latency, not pipelined
- sqrtsd: 38c latency, not pipelined
- fsqrt (x87): 43c latency, not pipelined
- rsqrtss / mulss: 4c + 6c latency. Possibly one per 3c throughput, since they apparently don't need the same execution unit (mmx vs. fp).
- SIMD packed versions are somewhat slower
Skylake:
- sqrtss/sqrtps: 12c latency, one per 3c throughput
- sqrtsd/sqrtpd: 15-16c latency, one per 4-6c throughput
- fsqrt (x87): 14-21cc latency, one per 4-7c throughput
- rsqrtss / mulss: 4c + 4c latency. One per 1c throughput.
- SIMD 128b vector versions are the same speed. 256b vector versions are a bit higher latency, almost half throughput. The rsqrtss version has full performance for 256b vectors.
With a Newton Iteration, the rsqrt version is not much if at all faster.

Numbers from Agner Fog's experimental testing. See his microarch guides to understand what makes code run fast or slow. Also see links at the x86 tag wiki.

IDK how best to calculate sin/cos. I've read that the hardware fsin / fcos (and the only slightly slower fsincos that does both at once) are not the fastest way, but IDK what is.
0 讨论(0)
发布评论:

提交评论
- 加载中...