问题
I've reduced my code down to the following, which is as simple as I could make it whilst retaining the compiler output that interests me.
void foo(const uint64_t used)
{
uint64_t ar[100];
for(int i = 0; i < 100; ++i)
{
ar[i] = some_global_array[i];
}
const uint64_t mask = ar[0];
if((used & mask) != 0)
{
return;
}
bar(ar); // Not inlined
}
Using VC10 with /O2 and /Ob1, the generated assembly pretty much reflects the order of instructions in the above C++ code. Since the local array ar
is only passed to bar()
when the condition fails, and is otherwise unused, I would have expected the compiler to optimize to something like the following.
if((used & some_global_array[0]) != 0)
{
return;
}
// Now do the copying to ar and call bar(ar)...
Is the compiler not doing this because it's simply too hard for it to identify such optimizations in the general case? Or is it following some strict rule that forbids it from doing so? If so, why, and is there some way I can give it a hint that doing so wouldn't change the semantics of my program?
Note: obviously it would be trivial to obtain the optimized output by just rearranging the code, but I'm interested in why the compiler won't optimize in such cases, not how to do so in this (intentionally simplified) case.
回答1:
There are no "strict rules" controlling what kind of assembly language the compiler is permitted to output. If the compiler can be certain that a block of code does not need to be executed (because it has no side effects) due to some precondition, then it is absolutely permitted to short-circuit the whole thing.
This sort of optimisation can be fairly complex in the general case, and your compiler might not go to all that effort. If this is performance critical code, then you can fine-tune your source code (as you suggest) to help the compiler generate the best assembly code. This is a trial-and-error process though, and you might have to do it again for the next version of the compiler.
回答2:
Probably the reason why this is not getting optimized is the global array. The compiler can't know beforehand if, say, accessing some_global_array[99]
will result in some kind of exception/signal being generated so it has to execute the whole loop. Things would be pretty different if the global array was statically defined in the same compilation unit.
For example, in LLVM, the following three definitions of the global array will yield wildly differing outputs of that function:
// this yields pretty much what you're seeing
uint64_t *some_global_array;
// this calls memcpy and then performs the conditional check
uint64_t some_global_array[100] = {0};
// this calls memset (not memcpy!) on the ar array and then bar directly (no
// conditional checks since the array is const and filled with 0s, so the if
// is always false)
const uint64_t some_global_array[100] = {0};
The second is pretty puzzling, but it may simply be a missed optimization (or maybe I'm missing something else).
来源:https://stackoverflow.com/questions/9441882/compiler-instruction-reordering-optimizations-in-c-and-what-inhibits-them