In Java, I have two different statements which accomplish the same result through using ternary operators, which are as follows:
num < 0 ? 0 : num;
First, let's rewrite the benchmark with JMH to avoid common benchmarking pitfalls.
public class FloatCompare {
@Benchmark
public float cmp() {
float num = ThreadLocalRandom.current().nextFloat() * 2 - 1;
return num < 0 ? 0 : num;
}
@Benchmark
public float mul() {
float num = ThreadLocalRandom.current().nextFloat() * 2 - 1;
return num * (num < 0 ? 0 : 1);
}
}
JMH also suggests that the multiplication code is a way faster:
Benchmark Mode Cnt Score Error Units
FloatCompare.cmp avgt 5 12,940 ± 0,166 ns/op
FloatCompare.mul avgt 5 6,182 ± 0,101 ns/op
Now it's time to engage perfasm profiler (built into JMH) to see the assembly produced by JIT compiler. Here are the most important parts of the output (comments are mine):
cmp method:
5,65% │││ 0x0000000002e717d0: vxorps xmm1,xmm1,xmm1 ; xmm1 := 0
0,28% │││ 0x0000000002e717d4: vucomiss xmm1,xmm0 ; compare num < 0 ?
4,25% │╰│ 0x0000000002e717d8: jbe 2e71720h ; jump if num >= 0
9,77% │ ╰ 0x0000000002e717de: jmp 2e71711h ; jump if num < 0
mul method:
1,59% ││ 0x000000000321f90c: vxorps xmm1,xmm1,xmm1 ; xmm1 := 0
3,80% ││ 0x000000000321f910: mov r11d,1h ; r11d := 1
││ 0x000000000321f916: xor r8d,r8d ; r8d := 0
││ 0x000000000321f919: vucomiss xmm1,xmm0 ; compare num < 0 ?
2,23% ││ 0x000000000321f91d: cmovnbe r11d,r8d ; r11d := r8d if num < 0
5,06% ││ 0x000000000321f921: vcvtsi2ss xmm1,xmm1,r11d ; xmm1 := (float) r11d
7,04% ││ 0x000000000321f926: vmulss xmm0,xmm1,xmm0 ; multiply
The key difference is that there's no jump instructions in the mul method. Instead, conditional move instruction cmovnbe is used.
cmov works with integer registers. Since (num < 0 ? 0 : 1) expression uses integer constants on the right side, JIT is smart enough to emit a conditional move instead of a conditional jump.
In this benchmark, conditional jump is very inefficient, since branch prediction often fails due to random nature of numbers. That's why the branchless code of mul method appears faster.
If we modify the benchmark in a way that one branch prevails over another, e.g by replacing
ThreadLocalRandom.current().nextFloat() * 2 - 1
with
ThreadLocalRandom.current().nextFloat() * 2 - 0.1f
then the branch prediction will work better, and cmp method will become as fast as mul:
Benchmark Mode Cnt Score Error Units
FloatCompare.cmp avgt 5 5,793 ± 0,045 ns/op
FloatCompare.mul avgt 5 5,764 ± 0,048 ns/op