Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?
I'm considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps to relax the alignment constraint and use _mm_loadu_ps . There are a lot of myths about the performance implications of memory alignment for SSE instructions, so I made a small test case of what should be a memory-bandwidth bound loop. Using either the aligned or unaligned load intrinsic, it runs 100 iterations through a large array, summing the elements with SSE intrinsics. The source code is here. https://gist.github.com/rmcgibbo/7689820 The results on a 64 bit Macbook