Have different optimizations (plain, SSE, AVX) in the same executable with C/C++

前端 未结 3 1696
生来不讨喜
生来不讨喜 2021-01-02 16:46

I\'m developing optimizations for my 3D calculations and I now have:

  • a \"plain\" version using the standard C language libraries,
  • an
3条回答
  •  鱼传尺愫
    2021-01-02 17:31

    Of course it's possible.

    The best way to do it is to have functions that do the complete job, and select among them at runtime. This would work but is not optimal:

    typedef enum
    {
        calc_type_invalid = 0,
        calc_type_plain,
        calc_type_sse,
        calc_type_avx,
        calc_type_max // not a valid value
    } calc_type;
    
    void do_my_calculation(float const *input, float *output, size_t len, calc_type ct)
    {
        float f;
        size_t i;
    
        for (i = 0; i < len; ++i)
        {
            switch (ct)
            {
                case calc_type_plain:
                    // plain calculation here
                    break;
                case calc_type_sse:
                    // SSE calculation here
                    break;
                case calc_type_avx:
                    // AVX calculation here
                    break;
                default:
                    fprintf(stderr, "internal error, unexpected calc_type %d", ct);
                    exit(1);
                    break
            }
        }
    }
    

    On each pass through the loop, the code is executing a switch statement, which is just overhead. A really clever compiler could theoretically fix it for you, but better to fix it yourself.

    Instead, write three separate functions, one for plain, one for SSE, and one for AVX. Then decide at runtime which one to run.

    For bonus points, in a "debug" build, do the calculation with both the SSE and the plain, and assert that the results are close enough to give confidence. Write the plain version, not for speed, but for correctness; then use its results to verify that your clever optimized versions get the correct answer.

    The legendary John Carmack recommends the latter approach; he calls it "parallel implementations". Read his essay about it.

    So I recommend you write the plain version first. Then, go back and start re-writing parts of your application using SSE or AVX acceleration, and make sure that the accelerated versions give the correct answers. (And sometimes, the plain version might have a bug that the accelerated version doesn't. Having two versions and comparing them helps make bugs come to light in either version.)

提交回复
热议问题