Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

前端 未结 4 2064
礼貌的吻别
礼貌的吻别 2021-02-02 12:51

I\'m considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps to relax the alignment constraint and

4条回答
  •  误落风尘
    2021-02-02 13:21

    This is architecture dependent and recent generations have improved things significantly. On the older Core2 architecture on the other hand:

    $ gcc -O3 -fno-inline foo2.c -o a; ./a 1000000 
    Array Size: 3.815 MB                    
    Trial 1
    _mm_load_ps with aligned memory:    0.003983
    _mm_loadu_ps with aligned memory:   0.003889
    _mm_loadu_ps with unaligned memory: 0.008085
    Trial 2
    _mm_load_ps with aligned memory:    0.002553
    _mm_loadu_ps with aligned memory:   0.002567
    _mm_loadu_ps with unaligned memory: 0.006444
    Trial 3
    _mm_load_ps with aligned memory:    0.002557
    _mm_loadu_ps with aligned memory:   0.002552
    _mm_loadu_ps with unaligned memory: 0.006430
    Trial 4
    _mm_load_ps with aligned memory:    0.002563
    _mm_loadu_ps with aligned memory:   0.002568
    _mm_loadu_ps with unaligned memory: 0.006436
    Trial 5
    _mm_load_ps with aligned memory:    0.002543
    _mm_loadu_ps with aligned memory:   0.002565
    _mm_loadu_ps with unaligned memory: 0.006400
    

提交回复
热议问题