How do optimizing compilers decide when and how much to unroll a loop?

问题

When a compiler performs a loop-unroll optimization, how does it determined by which factor to unroll the loop or whether to unroll the whole loop? Since this is a space-performance trade-off, on average how effictive is this optimization technique in making the program perform better? Also, under what conditions is it recommended to use this technique (i.e certain operations or calculations)?

This doesn't have to be specific to a certain compiler. It can be any explanation outlining the idea behind this technique and what has been observed in practice.

回答1:

When a compiler performs a loop unroll optimization, how does it determined by which factor to unroll the loop or weather to unroll the whole loop or not.

stack consumption and locality. instruction counts. ability to make/propagate optimizations based on the unrolled and inlined program. whether the loop size is fixed, or expected to be in a certain range. profile inputs (if applicable). operations which may be removed from the loop body. etc.

Since this is a space-performance tradeoff on average how effictive is this optimization technique in making the program perform better?

it depends largely on the input (your program). it can be slower (not typical) or it can be several times faster. writing a program to run optimally and which also enables the optimizer to do its job is learned.

Also, under what conditions is it recommended to use this technique (i.e certain operations or calculations)

generally, a large number of iterations on very small bodies, particularly that which is branchless and has good data locality.

if you want to know if the option helps your app, profile.

if you need more than that, you should reserve some time to learn how to write optimal programs, since the subject is quite complex.

回答2:

The simplistic analysis is to count instructions - a 2 instruction loop unrolled 10 times has 11 instructions instead of 20 yields a 11/20 speedup. But with modern processor architectures it's much more complex; depending on cache sizes and the characteristics of the processors instruction pipeline. It's possible that the above example would run 10x faster instead of 2x. It's also possible that unrolling 1000x instead of 10x would run slower. Without targeting a specific processor, compilers (or pragmas you write for them) are just guessing.

回答3:

when it is (in my opinion) good to unroll a loop:

loop is short and possibly all variables used are in processor register. After unrolling variables are 'duplicated' but still are in registers so no memory(or cache) penalty.

loop (with unknown loop unrool number) will be executed at least few or dozen times, so there's justification for loading that whole loop unrolled to instruction cache.

if loop is short (one or just few intructions) it might be very beneficial for unrolling because code for determining if it should be executed again is executed less often.

回答4:

Ok, first of all, i don't know how the compilers does this automatically. And i'm pretty sure there are at least 10s if not 100s of algorithms that compilers have to choose from.
And it's probably compiler-specific anyway.

But, I can help you with calculating its effectiveness.

Just note that this technique usually doesn't give you a great performance boost.
But in repeated looped calculations and can give high percentage performance.
This is because usually the function inside the loop takes much more computation time than the loop's condition checking.

So, lets say we have a simple loop with a constant, cause you were too lazy to do copy-paste or just thought it would look better:

for (int i = 0; i < 5; i++)
{
    DoSomething();
}

Here you have 5 int comparisons, 5 incrementations, and 5 DoSomethig() calls.
So if DoSomething() is relatively quick, then we got 15 operations.
Now if you'll unroll this, you'll reduce it to just 5 operations:

DoSomething();
DoSomething();
DoSomething();
DoSomething();
DoSomething();

Now with constants it's easier, so lets see how it would work with a variable:

for (int i = 0; i < n; i++)
{
    DoSomething();
}

Here you have n int comparisons, n incrementations, and n DoSomethig() calls = 3n . Now, we can't unroll it entirely, but we could unroll it by a constant factor (the higher n is expected to be, the more we should unroll it):

int i;
for (i = 0; i < n; i = i+3)
{
    DoSomething();
    DoSomething();
    DoSomething();
}
if (i - n == 2)
{
    DoSomething(); // We passed n by to, so there's one more left
}
else if (i - n == 1)
{
    DoSomething();  //We passed n by only 1, so there's two more left
    DoSomething();
}

Now here we have Here you have n/3+2 int comparisons, n/3 incrementations, and n DoSomethig() calls = (1 2/3)*n.
We saved ourselves (1 1/3)*n operations. Which cuts the computation time almost in half.

FYI, another neat unrolling technique is called Duff's device.
But it's very compiler and language-implementation specific. There are languages where this would actually be worse.

来源：https://stackoverflow.com/questions/7691215/how-do-optimizing-compilers-decide-when-and-how-much-to-unroll-a-loop

标签

c++

performance

compiler-optimization

loop-unrolling