I asked a question about reducing the miss prediction.
Jerry Coffin give me an impressive answer.
About reducing the branch miss prediciton
The binary se
Because that version is doing a ton of loads and stores.
Branch prediction in a tight loop like that often has no effect because the processor has multiple pipelines. As the branch test is being evaluated, both code paths are already being decoded and evaluated. Only the results of one path are kept - but there is usually no pipeline stall from a branch.
Writing to memory on the other hand can have an effect. Usually you are writing to a memory cache on the CPU, but the MMU then has to keep the cache lines sync'd to the rest of the system If the array is large and you are accessing it in essentially random order, you are getting constant cache misses and making the CPU reload memory cache.