问题
A developer can use the __builtin_expect
builtin to help the compiler understand in which direction a branch is likely to go.
In the future, we may get a standard attribute for this purpose, but as of today at least all of clang
, icc
and gcc
support the non-standard __builtin_expect
instead.
However, icc
seems to generate oddly terrible code when you use it1. That is, code that is uses the builtin is strictly worse than the code without it, regardless of which direction the prediction is made.
Take for example the following toy function:
int foo(int a, int b)
{
do {
a *= 77;
} while (b-- > 0);
return a * 77;
}
Out of the three compilers, icc
is the only one that compiles this to the optimal scalar loop of 3 instructions:
foo(int, int):
..B1.2: # Preds ..B1.2 ..B1.1
imul edi, edi, 77 #4.6
dec esi #5.12
jns ..B1.2 # Prob 82% #5.18
imul eax, edi, 77 #6.14
ret
Both gcc and Clang manage the miss the easy solution and use 5 instructions.
On the other hand, when you use likely
or unlikely
macros on the loop condition, icc
goes totally braindead:
#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)
int foo(int a, int b)
{
do {
a *= 77;
} while (likely(b-- > 0));
return a * 77;
}
This loop is functionally equivalent to the previous loop (since __builtin_expect
just returns its first argument), yet icc produces some awful code:
foo(int, int):
mov eax, 1 #9.12
..B1.2: # Preds ..B1.2 ..B1.1
xor edx, edx #9.12
test esi, esi #9.12
cmovg edx, eax #9.12
dec esi #9.12
imul edi, edi, 77 #8.6
test edx, edx #9.12
jne ..B1.2 # Prob 95% #9.12
imul eax, edi, 77 #11.15
ret #11.15
The function has doubled in size to 10 instructions, and (worse yet!) the critical loop has more than doubled to 7 instructions with a long critical dependency chain involving a cmov
and other weird stuff.
The same is true if you use the unlikely hint and also across all icc versions (13, 14, 17) that godbolt supports. So the code generation is strictly worse, regardless of the hint, and regardless of the actual runtime behavior.
Neither gcc
nor clang
suffer any degradation when hints are used.
What's up with that?
1 At least in the first and subsequent examples I tried.
回答1:
To me it seems an ICC bug. This code (available on godbolt)
int c;
do
{
a *= 77;
c = b--;
}
while (likely(c > 0));
that simply use an auxiliary local var c
, produces an output without the edx = !!(esi > 0)
pattern
foo(int, int):
..B1.2:
mov eax, esi
dec esi
imul edi, edi, 77
test eax, eax
jg ..B1.2
still not optimal (it could do without eax
), though.
I don't know if the official ICC policy about __builtin_expect is full support or just compatibility support.
This question seems better suited for the Official ICC forum.
I've tried posting this topic there but I'm not sure I've made a good job (I've been spoiled by SO).
If they answer me I'll update this answer.
EDIT
I've got and an answer at the Intel Forum, they recorded this issue in their tracking system.
As today, it seems a bug.
回答2:
Don't let the instructions deceive you. What matters is performance.
Consider this rather crude test :
#include "stdafx.h"
#include <windows.h>
#include <iostream>
int foo(int a, int b) {
do { a *= 7; } while (b-- > 0);
return a * 7;
}
int fooA(int a, int b) {
__asm {
mov esi, b
mov edi, a
mov eax, a
B1:
imul edi, edi, 7
dec esi
jns B1
imul eax, edi, 7
}
}
int fooB(int a, int b) {
__asm {
mov esi, b
mov edi, a
mov eax, 1
B1:
xor edx, edx
test esi, esi
cmovg edx, eax
dec esi
imul edi, edi, 7
test edx, edx
jne B1
imul eax, edi, 7
}
}
int main() {
DWORD start = GetTickCount();
int j = 0;
for (int aa = -10; aa < 10; aa++) {
for (int bb = -500; bb < 15000; bb++) {
j += foo(aa, bb);
}
}
std::cout << "foo compiled (/Od)\n" << "j = " << j << "\n"
<< GetTickCount() - start << "ms\n\n";
start = GetTickCount();
j = 0;
for (int aa = -10; aa < 10; aa++) {
for (int bb = -500; bb < 15000; bb++) {
j += fooA(aa, bb);
}
}
std::cout << "optimal scalar\n" << "j = " << j << "\n"
<< GetTickCount() - start << "ms\n\n";
start = GetTickCount();
j = 0;
for (int aa = -10; aa < 10; aa++) {
for (int bb = -500; bb < 15000; bb++) {
j += fooB(aa, bb);
}
}
std::cout << "use likely \n" << "j = " << j << "\n"
<< GetTickCount() - start << "ms\n\n";
std::cin.get();
return 0;
}
produces output:
foo compiled (/Od)
j = -961623752
4422msoptimal scalar
j = -961623752
1656msuse likely
j = -961623752
1641ms
This is naturally entirely CPU dependent (tested here on Haswell i7), but both asm loops generally are very nearly identical in performance when tested over a range of inputs. A lot of this has to do with the selection and ordering of instructions being conducive to leveraging instruction pipelining (latency), branch prediction, and other hardware optimizations in the CPU.
The real lesson when you're optimizing is that you need to profile - it's extremely difficult to do this by inspection of the raw assembly.
Even giving a challenging test where likely(b-- >0)
isn't true over a third of the time :
for (int aa = -10000000; aa < 10000000; aa++) {
for (int bb = -3; bb < 9; bb++) {
j += fooX(aa, bb);
}
}
results in :
foo compiled (/Od) : 1844ms
optimal scalar : 906ms
use likely : 1187ms
Which isn't bad. What you have to keep in mind is that the compiler will generally do its best without your interference. Using __builtin_expect
and the like should really be restricted to cases where you have existing code that you have profiled and that you have specifically identified as being both hotspots and as having pipeline or prediction issues. This trivial example is an ideal case where the compiler will almost certainly do the right thing without help from you.
By including __builtin_expect
you're asking the compiler to necessarily compile in a different way - a more complex way, in terms of pure number of instructions, but a more intelligent way in that it structures the assembly in a way that helps the CPU make better branch predictions. In this case of pure register play (as in this example) there's not much at stake, but if it improves prediction in a more complex loop, maybe saving you a bad misprediction, cache misses, and related collateral damage, then it's probably worth using.
I think it's pretty clear here, at least, that when the branch actually is likely then we very nearly recover the full performance of the optimal loop (which I think is impressive). In cases where the "optimal loop" is rather more complex and less trivial we can expect that the codegen would indeed improve branch prediction rates (which is what this is really all about). I think this is really a case of if you don't need it, don't use it.
On the topic of likely
vs unlikely
generating the same assembly, this doesn't imply that the compiler is broken - it just means that the same codegen is effective regardless of whether the branch is mostly taken or mostly not taken - as long as it is mostly something, it's good (in this case). The codegen is designed to optimise use of the instruction pipeline and to assist branch prediction, which it does. While we saw some reduction in performance with the mixed case above, pushing the loop to mostly unlikely
recovers performance.
for (int aa = -10000000; aa < 10000000; aa++) {
for (int bb = -30; bb < 1; bb++) {
j += fooX(aa, bb);
}
}
foo compiled (/Od) : 2453ms
optimal scalar : 1968ms
use likely : 2094ms
来源:https://stackoverflow.com/questions/41731642/why-does-icc-fail-to-handle-compile-time-branch-hints-in-a-reasonable-way