Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

前端未结

关注

 4  2064

礼貌的吻别 2021-02-02 12:51

I\'m considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps to relax the alignment constraint and

4条回答

误落风尘 (楼主)

2021-02-02 13:21

This is architecture dependent and recent generations have improved things significantly. On the older Core2 architecture on the other hand:

$ gcc -O3 -fno-inline foo2.c -o a; ./a 1000000 
Array Size: 3.815 MB                    
Trial 1
_mm_load_ps with aligned memory:    0.003983
_mm_loadu_ps with aligned memory:   0.003889
_mm_loadu_ps with unaligned memory: 0.008085
Trial 2
_mm_load_ps with aligned memory:    0.002553
_mm_loadu_ps with aligned memory:   0.002567
_mm_loadu_ps with unaligned memory: 0.006444
Trial 3
_mm_load_ps with aligned memory:    0.002557
_mm_loadu_ps with aligned memory:   0.002552
_mm_loadu_ps with unaligned memory: 0.006430
Trial 4
_mm_load_ps with aligned memory:    0.002563
_mm_loadu_ps with aligned memory:   0.002568
_mm_loadu_ps with unaligned memory: 0.006436
Trial 5
_mm_load_ps with aligned memory:    0.002543
_mm_loadu_ps with aligned memory:   0.002565
_mm_loadu_ps with unaligned memory: 0.006400

0 讨论(0)

查看其它4个回答