CLANG optimizing using SVML and it's autovectorization

左心房为你撑大大i 提交于 2019-12-11 00:58:14

问题


Consider simple function:

#include <math.h>
void ahoj(float *a)
{
    for (int i=0; i<256; i++) a[i] = sin(a[i]);
}

Try that at https://godbolt.org/z/ynQKRb, and use following settings

-fveclib=SVML -mfpmath=sse -ffast-math -fno-math-errno -O3 -mavx2 -fvectorize

Select x86_64 CLANG 7.0, currently the newest. This is the most interesting part of the result:

vmovups ymm0, ymmword ptr [rdi]
vmovups ymm1, ymmword ptr [rdi + 32]
vmovups ymmword ptr [rsp], ymm1 # 32-byte Spill
vmovups ymm1, ymmword ptr [rdi + 64]
vmovups ymmword ptr [rsp + 32], ymm1 # 32-byte Spill
vmovups ymm1, ymmword ptr [rdi + 96]
vmovups ymmword ptr [rsp + 96], ymm1 # 32-byte Spill
call    __svml_sinf8
vmovups ymmword ptr [rsp + 64], ymm0 # 32-byte Spill
vmovups ymm0, ymmword ptr [rsp] # 32-byte Reload
call    __svml_sinf8
vmovups ymmword ptr [rsp], ymm0 # 32-byte Spill
vmovups ymm0, ymmword ptr [rsp + 32] # 32-byte Reload
call    __svml_sinf8
vmovups ymmword ptr [rsp + 32], ymm0 # 32-byte Spill
vmovups ymm0, ymmword ptr [rsp + 96] # 32-byte Reload
call    __svml_sinf8
vmovups ymm1, ymmword ptr [rsp + 64] # 32-byte Reload
vmovups ymmword ptr [rbx], ymm1
vmovups ymm1, ymmword ptr [rsp] # 32-byte Reload
vmovups ymmword ptr [rbx + 32], ymm1
vmovups ymm1, ymmword ptr [rsp + 32] # 32-byte Reload
vmovups ymmword ptr [rbx + 64], ymm1
vmovups ymmword ptr [rbx + 96], ymm0
vmovups ymm0, ymmword ptr [rbx + 128]
vmovups ymm1, ymmword ptr [rbx + 160]
vmovups ymmword ptr [rsp], ymm1 # 32-byte Spill
vmovups ymm1, ymmword ptr [rbx + 192]
vmovups ymmword ptr [rsp + 32], ymm1 # 32-byte Spill
vmovups ymm1, ymmword ptr [rbx + 224]
vmovups ymmword ptr [rsp + 96], ymm1 # 32-byte Spill
call    __svml_sinf8
vmovups ymmword ptr [rsp + 64], ymm0 # 32-byte Spill
vmovups ymm0, ymmword ptr [rsp] # 32-byte Reload
call    __svml_sinf8
vmovups ymmword ptr [rsp], ymm0 # 32-byte Spill
vmovups ymm0, ymmword ptr [rsp + 32] # 32-byte Reload
call    __svml_sinf8
vmovups ymmword ptr [rsp + 32], ymm0 # 32-byte Spill
vmovups ymm0, ymmword ptr [rsp + 96] # 32-byte Reload
call    __svml_sinf8
...

It literally avoids any loops and instead creates code for processing 256 items. Could this really be optimal solution considering code cache? When using -mavx512f, it expands even 1024 items :).

Another problem is that with this option current CLANG sometimes generates AVX512 code even if the target is AVX2 making it basically unusable.

来源:https://stackoverflow.com/questions/52562272/clang-optimizing-using-svml-and-its-autovectorization

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!