I\'m considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps
to relax the alignment constraint and
This is architecture dependent and recent generations have improved things significantly. On the older Core2 architecture on the other hand:
$ gcc -O3 -fno-inline foo2.c -o a; ./a 1000000
Array Size: 3.815 MB
Trial 1
_mm_load_ps with aligned memory: 0.003983
_mm_loadu_ps with aligned memory: 0.003889
_mm_loadu_ps with unaligned memory: 0.008085
Trial 2
_mm_load_ps with aligned memory: 0.002553
_mm_loadu_ps with aligned memory: 0.002567
_mm_loadu_ps with unaligned memory: 0.006444
Trial 3
_mm_load_ps with aligned memory: 0.002557
_mm_loadu_ps with aligned memory: 0.002552
_mm_loadu_ps with unaligned memory: 0.006430
Trial 4
_mm_load_ps with aligned memory: 0.002563
_mm_loadu_ps with aligned memory: 0.002568
_mm_loadu_ps with unaligned memory: 0.006436
Trial 5
_mm_load_ps with aligned memory: 0.002543
_mm_loadu_ps with aligned memory: 0.002565
_mm_loadu_ps with unaligned memory: 0.006400