I was curious if java.lang.Integer.rotateLeft gets optimized by using a rotation instruction and wrote a benchmark for it. The results were inconclusive: It was
According to this benchmark, the shifts and rotate both have the same latency on your CPU, but rotates have a lower throughput (results listed there as "T" are reciprocal throughput, which is more easily comparable with latencies). That could have precisely the kind of result you're seeing - the lower throughput sort of gets in the way a little, but you weren't completely saturating the execution units so it doesn't show the full factor of 2 difference. Testing that yourself is not easy, especially not if you have to fight a compiler to make it emit the instructions your want.