C: Improving performance of function with heavy sin() usage

梦想的初衷 提交于 2019-12-09 16:26:01

问题


I have a C function that computes the values of 4 sines based on time elapsed. Using gprof, I figured that this function uses 100% (100.7% to be exact lol) of the CPU time.

void
update_sines(void)
{
    clock_gettime(CLOCK_MONOTONIC, &spec);
    s = spec.tv_sec;
    ms = spec.tv_nsec * 0.0000001;
    etime = concatenate((long)s, ms);

    int k;
    for (k = 0; k < 799; ++k)
    {
        double A1 = 145 * sin((RAND1 * k + etime) * 0.00333) + RAND5;           // Amplitude
        double A2 = 100 * sin((RAND2 * k + etime) * 0.00333) + RAND4;           // Amplitude
        double A3 = 168 * sin((RAND3 * k + etime) * 0.00333) + RAND3;           // Amplitude
        double A4 = 136 * sin((RAND4 * k + etime) * 0.00333) + RAND2;           // Amplitude

        double B1 = 3 + RAND1 + (sin((RAND5 * k) * etime) * 0.00216);           // Period
        double B2 = 3 + RAND2 + (sin((RAND4 * k) * etime) * 0.002);         // Period
        double B3 = 3 + RAND3 + (sin((RAND3 * k) * etime) * 0.00245);           // Period
        double B4 = 3 + RAND4 + (sin((RAND2 * k) * etime) * 0.002);         // Period

        double x = k;                                   // Current x

        double C1 = 0.6 * etime;                            // X axis move
        double C2 = 0.9 * etime;                            // X axis move
        double C3 = 1.2 * etime;                            // X axis move
        double C4 = 0.8 * etime + 200;                          // X axis move

        double D1 = RAND1 + sin(RAND1 * x * 0.00166) * 4;               // Y axis move
        double D2 = RAND2 + sin(RAND2 * x * 0.002) * 4;                 // Y axis move
        double D3 = RAND3 + cos(RAND3 * x * 0.0025) * 4;                // Y axis move
        double D4 = RAND4 + sin(RAND4 * x * 0.002) * 4;                 // Y axis move

        sine1[k] = A1 * sin((B1 * x + C1) * 0.0025) + D1;
        sine2[k] = A2 * sin((B2 * x + C2) * 0.00333) + D2 + 100;
        sine3[k] = A3 * cos((B3 * x + C3) * 0.002) + D3 + 50;
        sine4[k] = A4 * sin((B4 * x + C4) * 0.00333) + D4 + 100;
    }

}

And this is the output from gprof:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
100.07      0.04     0.04  

I'm currently getting a frame rate of roughly 30-31 fps using this. Now I figure there as to be a more efficient way to do this.

As you noticed I already changed all the divisions to multiplications but that had very little effect on performance.

How could I increase the performance of this math heavy function?


回答1:


Besides all the other advice given in other answers, here is a pure algorithmic optimization.

In most cases, you're computing something of the form sin(k * a + b), where a and b are constants, and k is a loop variable. If you were also to compute cos(k * a + b), then you could use a 2D rotation matrix to form a recurrence relationship (in matrix form):

|cos(k*a + b)| = |cos(a)  -sin(a)| * |cos((k-1)*a + b)|
|sin(k*a + b)|   |sin(a)   cos(a)|   |sin((k-1)*a + b)|

In other words, you can calculate the value for the current iteration in terms of the value from the previous iteration. Thus, you only need to to do the full trig calculation for k == 0, but the rest can be calculated via this recurrence (once you have calculated cos(a) and sin(a), which are constants). So you eliminate 75% of the trig function calls (it's not clear the same trick can be pulled for the final set of trig calls).




回答2:


If you don't need all that precision, create a lookup for the sin() values you need, so if 1 degree is enough, use double sin_lookup[360], etc.. And possibly float sin_lookup[360] if float precision is sufficient.

Also, as noted in comments, at a certain point as per Keith, "You might also consider using linear interpolation between lookup values, which should give you substantially better accuracy (a reasonably continuous function rather than a step function) at a fairly small cost in performance"

EDIT: also consider changing the hardcoded A1,A2,A3,A4 pattern to arrays of size[4], and looping from 0 to 3 - should allow vectorization on many platforms and allow parrellism without needing to manage threads

EDIT2: some code and results

(Coded in C++ just to make comparisons easy between precisions, calcs are the same in C)

class simple_trig
{
public:
        simple_trig(size_t prec) : precision(prec)
        {
                static const double PI=3.141592653589793;
                const double dprec=(double)prec;
                const double quotient=(2.0*PI)/dprec;
                rev_quotient=dprec/(2.0*PI);
                values.reserve(prec);

                for (int i=0; i < precision; ++i)
                {
                        values[i]=::sin(quotient*(double)i);
                }
        }

        double sin(double x) const
        {
                double cvt=x*rev_quotient;
                int index=(int)cvt;
                double delta=cvt-(double)index;
                int lookup1=index%precision;
                int lookup2=(index+1)%precision;
                return values[lookup1]*(1.0-delta)+values[lookup2]*delta;
        }

        double cos(double x) const
        {
                double cvt=x*rev_quotient;
                int index=(int)cvt;
                double delta=cvt-(double)index;
                int lookup1=(index+precision/4)%precision;
                int lookup2=(index+precision/4+1)%precision;
                return values[lookup1]*(1.0-delta)+values[lookup2]*delta;
        }

private:
        const size_t precision;
        double rev_quotient;
        std::vector<double> values;
};

Examples Low is 100, Med is 1000 and High is 10,000

X=0 Sin=0 Sin Low=0 Sin Med=0 Sin High=0
X=0 Cos=1 Cos Low=1 Cos Med=1 Cos High=1
X=0.5 Sin=0.479426 Sin Low=0.479389 Sin Med=0.479423 Sin High=0.479426
X=0.5 Cos=0.877583 Cos Low=0.877512 Cos Med=0.877578 Cos High=0.877583
X=1.33333 Sin=0.971938 Sin Low=0.971607 Sin Med=0.971935 Sin High=0.971938
X=1.33333 Cos=0.235238 Cos Low=0.235162 Cos Med=0.235237 Cos High=0.235238
X=2.25 Sin=0.778073 Sin Low=0.777834 Sin Med=0.778072 Sin High=0.778073
X=2.25 Cos=-0.628174 Cos Low=-0.627986 Cos Med=-0.628173 Cos High=-0.628174
X=3.2 Sin=-0.0583741 Sin Low=-0.0583689 Sin Med=-0.0583739 Sin High=-0.0583741
X=3.2 Cos=-0.998295 Cos Low=-0.998166 Cos Med=-0.998291 Cos High=-0.998295
X=4.16667 Sin=-0.854753 Sin Low=-0.854387 Sin Med=-0.854751 Sin High=-0.854753
X=4.16667 Cos=-0.519036 Cos Low=-0.518818 Cos Med=-0.519034 Cos High=-0.519036
X=5.14286 Sin=-0.90877 Sin Low=-0.908542 Sin Med=-0.908766 Sin High=-0.90877
X=5.14286 Cos=0.417296 Cos Low=0.417195 Cos Med=0.417294 Cos High=0.417296
X=6.125 Sin=-0.157526 Sin Low=-0.157449 Sin Med=-0.157526 Sin High=-0.157526
X=6.125 Cos=0.987515 Cos Low=0.987028 Cos Med=0.987512 Cos High=0.987515
X=7.11111 Sin=0.73653 Sin Low=0.736316 Sin Med=0.736527 Sin High=0.73653
X=7.11111 Cos=0.676405 Cos Low=0.676213 Cos Med=0.676403 Cos High=0.676405
X=8.1 Sin=0.96989 Sin Low=0.969741 Sin Med=0.969887 Sin High=0.96989
X=8.1 Cos=-0.243544 Cos Low=-0.24351 Cos Med=-0.243544 Cos High=-0.243544
X=9.09091 Sin=0.327701 Sin Low=0.327558 Sin Med=0.3277 Sin High=0.327701
X=9.09091 Cos=-0.944782 Cos Low=-0.944381 Cos Med=-0.944779 Cos High=-0.944782
X=10.0833 Sin=-0.611975 Sin Low=-0.611673 Sin Med=-0.611973 Sin High=-0.611975
X=10.0833 Cos=-0.790877 Cos Low=-0.790488 Cos Med=-0.790875 Cos High=-0.790877



回答3:


It seems to me that sine1, sine2, sine3 and sine4 arrays are completely independent from eachother. So you are basically running a single for loop for 4 different arrays which have no dependency.

Spawn 4 threads, 1 for each, so you have 4 for loops running at the same time. On multicore machine this should speed up your function dramatically. As a matter of fact, it should be a perfect 4x speedup (+- ...).




回答4:


Actually combining the use of threads (consider this with OpenMP) and the use of a table for the sin is a good idea. If possible use float instead of double and, depending on the platform, you could also use simd instructions, but the later would make the use of threads unnecessary.

Cheers



来源:https://stackoverflow.com/questions/20850291/c-improving-performance-of-function-with-heavy-sin-usage

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!