Why does icc fail to handle compile-time branch hints in a reasonable way?

问题

A developer can use the __builtin_expect builtin to help the compiler understand in which direction a branch is likely to go.

In the future, we may get a standard attribute for this purpose, but as of today at least all of clang, icc and gcc support the non-standard __builtin_expect instead.

However, icc seems to generate oddly terrible code when you use it¹. That is, code that is uses the builtin is strictly worse than the code without it, regardless of which direction the prediction is made.

Take for example the following toy function:

int foo(int a, int b)
{
  do {
     a *= 77;
  } while (b-- > 0);  
  return a * 77;
}

Out of the three compilers, icc is the only one that compiles this to the optimal scalar loop of 3 instructions:

foo(int, int):
..B1.2:                         # Preds ..B1.2 ..B1.1
        imul      edi, edi, 77                                  #4.6
        dec       esi                                           #5.12
        jns       ..B1.2        # Prob 82%                      #5.18
        imul      eax, edi, 77                                  #6.14
        ret

Both gcc and Clang manage the miss the easy solution and use 5 instructions.

On the other hand, when you use likely or unlikely macros on the loop condition, icc goes totally braindead:

#define likely(x)   __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)

int foo(int a, int b)
{

   do {
     a *= 77;
  } while (likely(b-- > 0));  

   return a * 77;
}

This loop is functionally equivalent to the previous loop (since __builtin_expect just returns its first argument), yet icc produces some awful code:

foo(int, int):
        mov       eax, 1                                        #9.12
..B1.2:                         # Preds ..B1.2 ..B1.1
        xor       edx, edx                                      #9.12
        test      esi, esi                                      #9.12
        cmovg     edx, eax                                      #9.12
        dec       esi                                           #9.12
        imul      edi, edi, 77                                  #8.6
        test      edx, edx                                      #9.12
        jne       ..B1.2        # Prob 95%                      #9.12
        imul      eax, edi, 77                                  #11.15
        ret                                                     #11.15

The function has doubled in size to 10 instructions, and (worse yet!) the critical loop has more than doubled to 7 instructions with a long critical dependency chain involving a cmov and other weird stuff.

The same is true if you use the unlikely hint and also across all icc versions (13, 14, 17) that godbolt supports. So the code generation is strictly worse, regardless of the hint, and regardless of the actual runtime behavior.

Neither gcc nor clang suffer any degradation when hints are used.

What's up with that?

¹ At least in the first and subsequent examples I tried.

回答1:

To me it seems an ICC bug. This code (available on godbolt)

int c;

do 
{
    a *= 77;
    c = b--;
} 
while (likely(c > 0));

that simply use an auxiliary local var c, produces an output without the edx = !!(esi > 0) pattern

foo(int, int):
  ..B1.2:                         
    mov       eax, esi
    dec       esi
    imul      edi, edi, 77
    test      eax, eax
    jg        ..B1.2

still not optimal (it could do without eax), though.

I don't know if the official ICC policy about __builtin_expect is full support or just compatibility support.

This question seems better suited for the Official ICC forum.
I've tried posting this topic there but I'm not sure I've made a good job (I've been spoiled by SO).
If they answer me I'll update this answer.

EDIT
I've got and an answer at the Intel Forum, they recorded this issue in their tracking system.
As today, it seems a bug.

回答2:

Don't let the instructions deceive you. What matters is performance.

Consider this rather crude test :

#include "stdafx.h"
#include <windows.h>
#include <iostream>

int foo(int a, int b) {
    do { a *= 7; } while (b-- > 0);
    return a * 7;
}

int fooA(int a, int b) {
    __asm {     
        mov     esi, b
        mov     edi, a
        mov     eax, a
        B1:                        
        imul    edi, edi, 7                           
        dec     esi                                         
        jns     B1      
        imul    eax, edi, 7    
    }
}

int fooB(int a, int b) {
    __asm {
        mov     esi, b
        mov     edi, a
        mov     eax, 1                                    
        B1:                        
        xor     edx, edx                              
        test    esi, esi                                   
        cmovg   edx, eax                                   
        dec     esi                                        
        imul    edi, edi, 7                                
        test    edx, edx                                   
        jne     B1      
        imul    eax, edi, 7
    }
}

int main() {
    DWORD start = GetTickCount();
    int j = 0;
    for (int aa = -10; aa < 10; aa++) {
        for (int bb = -500; bb < 15000; bb++) {
            j += foo(aa, bb);
        }
    }
    std::cout << "foo compiled (/Od)\n" << "j = " << j << "\n" 
        << GetTickCount() - start << "ms\n\n";

    start = GetTickCount();
    j = 0;
    for (int aa = -10; aa < 10; aa++) {
        for (int bb = -500; bb < 15000; bb++) {
            j += fooA(aa, bb);
        }
    }
    std::cout << "optimal scalar\n" << "j = " << j << "\n" 
        << GetTickCount() - start << "ms\n\n";

    start = GetTickCount();
    j = 0;
    for (int aa = -10; aa < 10; aa++) {
        for (int bb = -500; bb < 15000; bb++) {
            j += fooB(aa, bb);
        }
    }
    std::cout << "use likely \n" << "j = " << j << "\n" 
        << GetTickCount() - start << "ms\n\n";

    std::cin.get();
    return 0;
}

produces output:

foo compiled (/Od)
j = -961623752
4422ms

optimal scalar
j = -961623752
1656ms

use likely
j = -961623752
1641ms

This is naturally entirely CPU dependent (tested here on Haswell i7), but both asm loops generally are very nearly identical in performance when tested over a range of inputs. A lot of this has to do with the selection and ordering of instructions being conducive to leveraging instruction pipelining (latency), branch prediction, and other hardware optimizations in the CPU.

The real lesson when you're optimizing is that you need to profile - it's extremely difficult to do this by inspection of the raw assembly.

Even giving a challenging test where likely(b-- >0) isn't true over a third of the time :

for (int aa = -10000000; aa < 10000000; aa++) {
    for (int bb = -3; bb < 9; bb++) {
        j += fooX(aa, bb);
    }
}

results in :

foo compiled (/Od) : 1844ms

optimal scalar : 906ms

use likely : 1187ms

Which isn't bad. What you have to keep in mind is that the compiler will generally do its best without your interference. Using __builtin_expect and the like should really be restricted to cases where you have existing code that you have profiled and that you have specifically identified as being both hotspots and as having pipeline or prediction issues. This trivial example is an ideal case where the compiler will almost certainly do the right thing without help from you.

By including __builtin_expect you're asking the compiler to necessarily compile in a different way - a more complex way, in terms of pure number of instructions, but a more intelligent way in that it structures the assembly in a way that helps the CPU make better branch predictions. In this case of pure register play (as in this example) there's not much at stake, but if it improves prediction in a more complex loop, maybe saving you a bad misprediction, cache misses, and related collateral damage, then it's probably worth using.

I think it's pretty clear here, at least, that when the branch actually is likely then we very nearly recover the full performance of the optimal loop (which I think is impressive). In cases where the "optimal loop" is rather more complex and less trivial we can expect that the codegen would indeed improve branch prediction rates (which is what this is really all about). I think this is really a case of if you don't need it, don't use it.

On the topic of likely vs unlikely generating the same assembly, this doesn't imply that the compiler is broken - it just means that the same codegen is effective regardless of whether the branch is mostly taken or mostly not taken - as long as it is mostly something, it's good (in this case). The codegen is designed to optimise use of the instruction pipeline and to assist branch prediction, which it does. While we saw some reduction in performance with the mixed case above, pushing the loop to mostly unlikely recovers performance.

for (int aa = -10000000; aa < 10000000; aa++) {
    for (int bb = -30; bb < 1; bb++) {
        j += fooX(aa, bb);
    }
}

foo compiled (/Od) : 2453ms

optimal scalar : 1968ms

use likely : 2094ms

来源：https://stackoverflow.com/questions/41731642/why-does-icc-fail-to-handle-compile-time-branch-hints-in-a-reasonable-way

标签

optimization

x86

icc

built-in