Adding to the already posted answers I'd like to mention cache behaviour. A particular memory access pattern might be so much slower due to repeated cache misses that a theoretically slower algorithm with a more cache friendly memory access pattern performs much better.