Aligned and unaligned memory access with AVX/AVX2 intrinsics

柔情痞子 提交于 2019-12-03 12:24:48
Z boson

There is no way to explicitly control folding of loads with intrinsics. I consider this a weakness of intrinsics. If you want to explicitly control the folding then you have to use assembly.

In previous version of GCC I was able to control the folding to some degree using an aligned or unaligned load. However, that no longer appears to be the case (GCC 4.9.2). I mean for example in the function AddDot4x4_vec_block_8wide here the loads are folded

vmulps  ymm9, ymm0, YMMWORD PTR [rax-256]
vaddps  ymm8, ymm9, ymm8

However in a previous verison of GCC the loads were not folded:

vmovups ymm9, YMMWORD PTR [rax-256]
vmulps  ymm9, ymm0, ymm9
vaddps  ymm8, ymm8, ymm9

The correct solution is, obviously, to only used aligned loads when you know the data is aligned and if you really want to explicitly control the folding use assembly.

In addition to Z boson's answer I can tell that the compiler is rightfully doing load folding because it assumes the memory region is aligned (because of __attribute__ ((aligned(32))) marking the array). In runtime, however, that attribute does not work for values on the stack because the stack is only 16-byte aligned (see this bug). You can try forcing the compiler to realign the stack to 32 bytes upon entry in main by specifying -mstackrealign and -mpreferred-stack-boundary=5 (see here) but it will incur a performance overhead.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!