Are loads of variables that are aligned on word boundaries faster than unaligned load operations on x86/64 (Intel/AMD 64 bit) processors?
A colleague of mine argues
A Random Guy On The Internet I've found says that for the 486 says that an aligned 32-bit access takes one cycle. An unaligned 32-bit access that spans quads but is within the same cache line takes four cycles. An unaligned etc that spans multiple cache lines can take an extra six to twelve cycles.
Given that an unaligned access requires accessing multiple quads of memory, pretty much by definition, I'm not at all surprised by this. I'd imagine that better caching performance on modern processors makes the cost a little less bad, but it's still something to be avoided.
(Incidentally, if your code has any pretensions to portability... ia32 and descendants are pretty much the only modern architectures that support unaligned accesses at all. ARM, for example, can very between throwing an exception, emulating the access in software, or just loading the wrong value, depending on OS!)
Update: Here's someone who actually went and measured it. On his hardware he reckons unaligned access to be half as fast as aligned. Go try it for yourself...