Force compiler to not optimize side-effect-less statements

问题

I was reading some old game programming books and as some of you might know, back in that day it was usually faster to do bit hacks than do things the standard way. (Converting float to int, mask sign bit, convert back for absolute value, instead of just calling fabs(), for example)

Nowadays is almost always better to just use the standard library math functions, since these tiny things are hardly the cause of most bottlenecks anyway.

But I still want to do a comparison, just for curiosity's sake. So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements that have no side effect, such as:

void float_to_int(float f)
{
    int i = static_cast<int>(f); // has no side-effects
}

Is there a way to do this? As far as I can tell, doing something like i += 10 will still have no side-effect and as such won't solve the problem.

The only thing I can think of is having a global variable, int dummy;, and after the cast doing something like dummy += i, so the value of i is used. But I feel like this dummy operation will get in the way of the results I want.

I'm using Visual Studio 2008 / G++ (3.4.4).

Edit

To clarify, I would like to have all optimizations maxed out, to get good profile results. The problem is that with this the statements with no side-effect will be optimized out, hence the situation.

Edit Again

To clarify once more, read this: I'm not trying to micro-optimize this in some sort of production code.

We all know that the old tricks aren't very useful anymore, I'm merely curious how not useful they are. Just plain curiosity. Sure, life could go on without me knowing just how these old hacks perform against modern day CPU's, but it never hurts to know.

So telling me "these tricks aren't useful anymore, stop trying to micro-optimize blah blah" is an answer completely missing the point. I know they aren't useful, I don't use them.

Premature quoting of Knuth is the root of all annoyance.

回答1:

Compilers are unfortunately allowed to optimise as much as they like, even without any explicit switches, if the code behaves as if no optimisation takes place. However, you can often trick them into not doing so if you indicate that value might be used later, so I would change your code to:

int float_to_int(float f)
{
    return static_cast<int>(f); // has no side-effects
}

As others have suggested, you will need to examine the assemnler output to check that this approach actually works.

回答2:

Assignment to a volatile variable shold never be optimized away, so this might give you the result you want:

static volatile int i = 0;

void float_to_int(float f)
{
    i = static_cast<int>(f); // has no side-effects
}

回答3:

So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements

You are by definition skewing the results.

Here's how to fix the problem of trying to profile "dummy" code that you wrote just to test: For profiling, save your results to a global/static array and print one member of the array to the output at the end of the program. The compiler will not be able to optimize out any of the computations that placed values in the array, but you'll still get any other optimizations it can put in to make the code fast.

回答4:

In this case I suggest you make the function return the integer value:

int float_to_int(float f)
{
   return static_cast<int>(f);
}

Your calling code can then exercise it with a printf to guarantee it won't optimize it out. Also make sure float_to_int is in a separate compilation unit so the compiler can't play any tricks.

extern int float_to_int(float f)
int sum = 0;
// start timing here
for (int i = 0; i < 1000000; i++)
{
   sum += float_to_int(1.0f);
}
// end timing here
printf("sum=%d\n", sum);

Now compare this to an empty function like:

int take_float_return_int(float /* f */)
{
   return 1;
}

Which should also be external.

The difference in times should give you an idea of the expense of what you're trying to measure.

回答5:

What always worked on all compilers I used so far:

extern volatile int writeMe = 0;

void float_to_int(float f)
{    
  writeMe = static_cast<int>(f); 
}

note that this skews results, boith methods should write to writeMe.

volatile tells the compiler "the value may be accessed without your notice", thus the compiler cannot omit the calculation and drop the result. To block propagiation of input constants, you might need to run them through an extern volatile, too:

extern volatile float readMe = 0;
extern volatile int writeMe = 0;

void float_to_int(float f)
{    
  writeMe = static_cast<int>(f); 
}

int main()
{
  readMe = 17;
  float_to_int(readMe);
}

Still, all optimizations inbetween the read and the write can be applied "with full force". The read and write to the global variable are often good "fenceposts" when inspecting the generated assembly.

Without the extern the compiler may notice that a reference to the variable is never taken, and thus determine it can't be volatile. Technically, with Link Time Code Generation, it might not be enough, but I haven't found a compiler that agressive. (For a compiler that indeed removes the access, the reference would need to be passed to a function in a DLL loaded at runtime)

回答6:

You just need to skip to the part where you learn something and read the published Intel CPU optimisation manual.

These quite clearly state that casting between float and int is a really bad idea because it requires a store from the int register to memory followed by a load into a float register. These operations cause a bubble in the pipeline and waste many precious cycles.

回答7:

a function call incurs quite a bit of overhead, so I would remove this anyway.

adding a dummy += i; is no problem, as long as you keep this same bit of code in the alternate profile too. (So the code you are comparing it against).

Last but not least: generate asm code. Even if you can not code in asm, the generated code is typically understandable since it will have labels and commented C code behind it. So you know (sortoff) what happens, and which bits are kept.

p.s. found this too:

 inline float pslNegFabs32f(float x){
       __asm{
         fld  x //Push 'x' into st(0) of FPU stack
         fabs
         fchs   //change sign
         fstp x //Pop from st(0) of FPU stack
        }
       return x;
 }

supposedly also very fast. You might want to profile this too. (although it is hardly portable code)

回答8:

Return the value?

int float_to_int(float f)
{
    return static_cast<int>(f); // has no side-effects
}

and then at the call site, you can sum all the return values up, and print out the result when the benchmark is done. The usual way to do this is to somehow make sure you depend on the result.

You could use a global variable instead, but it seems like that'd generate more cache misses. Usually, simply returning the value to the caller (and making sure the caller actually does something with it) does the trick.

回答9:

A micro-benchmark around this statement will not be representative of using this approach in a genuine scenerio; the surrounding instructions and their affect on the pipeline and cache are generally as important as any given statement in itself.

回答10:

GCC 4 does a lot of micro-optimizations now, that GCC 3.4 has never done. GCC4 includes a tree vectorizer that turns out to do a very good job of taking advantage of SSE and MMX. It also uses the GMP and MPFR libraries to assist in optimizing calls to things like sin(), fabs(), etc., as well as optimizing such calls to their FPU, SSE or 3D Now! equivalents.

I know the Intel compiler is also extremely good at these kinds of optimizations.

My suggestion is to not worry about micro-optimizations like this - on relatively new hardware (anything built in the last 5 or 6 years), they're almost completely moot.

Edit: On recent CPUs, the FPU's fabs instruction is far faster than a cast to int and bit mask, and the fsin instruction is generally going to be faster than precalculating a table or extrapolating a Taylor series. A lot of the optimizations you would find in, for example, "Tricks of the Game Programming Gurus," are completely moot, and as pointed out in another answer, could potentially be slower than instructions on the FPU and in SSE.

All of this is due to the fact that newer CPUs are pipelined - instructions are decoded and dispatched to fast computation units. Instructions no longer run in terms of clock cycles, and are more sensitive to cache misses and inter-instruction dependencies.

Check the AMD and Intel processor programming manuals for all the gritty details.

来源：https://stackoverflow.com/questions/1152333/force-compiler-to-not-optimize-side-effect-less-statements

标签

c++

optimization