On a mailing list I\'m subscribed to, two fairly knowledgeable (IMO) programmers were discussing some optimized code, and saying something along the lines of:
Yes. but with a caveat. The idea that looping backwards is faster never applied to all older CPUs. It's an x86 thing (as in 8086 through 486, possibly Pentium, although I don't think any further).
That optimization never applied to any other CPU architecture that I know of.
Here's why.
The 8086 had a register that was specifically optimized for use as a loop counter. You put your loop count in CX, and then there are several instructions that decrement CX and then set condition codes if it goes to zero. In fact there was an instruction prefix you could put before other instructions (the REP prefix) that would basically iterate the other instruction until CX got to 0.
Back in the days when we counted instructions and instructions had known fixed cycle counts using cx as your loop counter was the way to go, and cx was optimized for counting down.
But that was a long time ago. Ever since the Pentium, those complex instructions have been slower overall than using more, and simpler instructions. (RISC baby!) The key thing we try to do these days is try to put some time between loading a register and using it because the pipelines can actually do multiple things per cycle as long as you don't try to use the same register for more than one thing at a time.
Nowdays the thing that kills performance isn't the comparison, it's the branching, and then only when the branch prediction predicts wrong.