I found that different compiler optimization levels in gcc give quite different results when accessing a local or a global variable in a loop. The reason this surprised me i
Global variable = global memory, and subject to aliasing (read as: bad for the optimizer -- must read-modify-write in the worst case).
Local variable = register (unless the compiler really can't help it, sometimes it must put it on the stack too, but the stack is practically guaranteed to be in L1)
Accessing a register is on the order of a single cycle, accessing memory is on the order of 15-1000 cycles (depending on whether the cache line is in cache and not invalidated by another core, and depending on whether the page is in the TLB).