问题
I noticed if we know there is good chance for control flow is true or false, we can tell it to compiler, for instance, in Linux kernel, there are lots of likely unlikely, actually impled by __builtin_expect provided by gcc, so I want to find out how does it work, then checked the assembly out there:
20:branch_prediction_victim.cpp **** if (array_aka[j] >= 128)
184 .loc 3 20 0 is_stmt 1
185 00f1 488B85D0 movq -131120(%rbp), %rax
185 FFFDFF
186 00f8 8B8485F0 movl -131088(%rbp,%rax,4), %eax
186 FFFDFF
187 00ff 83F87F cmpl $127, %eax
188 0102 7E17 jle .L13
Then for __builtin_expect
20:branch_prediction_victim.cpp **** if (__builtin_expect((array_aka[j] >= 128), 1))
184 .loc 3 20 0 is_stmt 1
185 00f1 488B85D0 movq -131120(%rbp), %rax
185 FFFDFF
186 00f8 8B8485F0 movl -131088(%rbp,%rax,4), %eax
186 FFFDFF
187 00ff 83F87F cmpl $127, %eax
188 0102 0F9FC0 setg %al
189 0105 0FB6C0 movzbl %al, %eax
190 0108 4885C0 testq %rax, %rax
191 010b 7417 je .L13
- 188 -
setgset if greater, here set if greater than what? - 189 -
movzblmove zero extend byte to long, I know this one move%alto%eax - 190 -
testqbitwise OR then set ZF CF flags, is this right?
I want to know how do they affect branch prediction, and improve performance, three extra instruction, more cycles needed right?
回答1:
setcc reads FLAGS, in this case set by the cmp right before. Read the manual.
This looks like you forgot to enable optimization, so __builtin_expect is just creating a 0 / 1 boolean value in a register and branching on it being non-zero, instead of branching on the original FLAGS condition. Don't look at un-optimized code, it's always going to suck.
The clues are: the braindead booleanizing as part of likely, and loading j from the stack using RBP as a frame pointer with movq -131120(%rbp), %rax
likely generally doesn't improve runtime branch prediction, it improves code layout to minimize the amount of taken branches when things go the way the source code said they would (i.e. the fast case). So it improves I-cache locality for the common case. e.g. the compiler will lay things out so the common case is a not-taken conditional branch, just falling through. This makes things easier for the front-end in superscalar pipelined CPUs that fetch/decode multiple instructions at once. Continuing to fetch in a straight line is easiest.
likely can actually get the compiler to use a branch instead of a cmov for cases that you know are predictable, even if compiler heuristics (without profile-guided optimization) would have gotten it wrong. Related: gcc optimization flag -O3 makes code slower than -O2
来源:https://stackoverflow.com/questions/61030543/how-to-understand-macro-likely-affecting-branch-prediction