I\'m developing optimizations for my 3D calculations and I now have:
plain
\" version using the standard C language libraries,
Of course it's possible.
The best way to do it is to have functions that do the complete job, and select among them at runtime. This would work but is not optimal:
typedef enum
{
calc_type_invalid = 0,
calc_type_plain,
calc_type_sse,
calc_type_avx,
calc_type_max // not a valid value
} calc_type;
void do_my_calculation(float const *input, float *output, size_t len, calc_type ct)
{
float f;
size_t i;
for (i = 0; i < len; ++i)
{
switch (ct)
{
case calc_type_plain:
// plain calculation here
break;
case calc_type_sse:
// SSE calculation here
break;
case calc_type_avx:
// AVX calculation here
break;
default:
fprintf(stderr, "internal error, unexpected calc_type %d", ct);
exit(1);
break
}
}
}
On each pass through the loop, the code is executing a switch
statement, which is just overhead. A really clever compiler could theoretically fix it for you, but better to fix it yourself.
Instead, write three separate functions, one for plain, one for SSE, and one for AVX. Then decide at runtime which one to run.
For bonus points, in a "debug" build, do the calculation with both the SSE and the plain, and assert that the results are close enough to give confidence. Write the plain version, not for speed, but for correctness; then use its results to verify that your clever optimized versions get the correct answer.
The legendary John Carmack recommends the latter approach; he calls it "parallel implementations". Read his essay about it.
So I recommend you write the plain version first. Then, go back and start re-writing parts of your application using SSE or AVX acceleration, and make sure that the accelerated versions give the correct answers. (And sometimes, the plain version might have a bug that the accelerated version doesn't. Having two versions and comparing them helps make bugs come to light in either version.)