Sometimes a loop where the CPU spends most of the time has some branch prediction miss (misprediction) very often (near .5 probability.) I\'ve seen a few techniques on very isol
In my opinion if you're reaching down to this level of optimization, it's probably time to drop right into assembly language.
Essentially you're counting on the compiler generating a specific pattern of assembly to take advantage of this optimization in C anyway. It's difficult to guess exactly what code a compiler is going to generate, so you'd have to look at it anytime a small change is made - why not just do it in assembly and be done with it?