SSE-copy, AVX-copy and std::copy performance

前端 未结 5 1620
庸人自扰
庸人自扰 2020-12-08 03:28

I\'m tried to improve performance of copy operation via SSE and AVX:

    #include 

    const int sz = 1024;
    float *mas = (float *)_mm_         


        
5条回答
  •  情书的邮戳
    2020-12-08 04:04

    Writing fast SSE is not as simple as using SSE operations in place of their non-parallel equivalents. In this case I suspect your compiler cannot usefully unroll the load/store pair and your time is dominated by stalls caused by using the output of one low-throughput operation (the load) in the very next instruction (the store).

    You can test this idea by manually unrolling one notch:

    //SSE-copy testing
    start2 = std::chrono::system_clock::now();
    for(int i=0; i

    Normally when using intrinsics I disassemble the output and make sure nothing crazy is going on (you could try this to verify if/how the original loop got unrolled). For more complex loops the right tool to use is the Intel Architecture Code Analyzer (IACA). It's a static analysis tool which can tell you things like "you have pipeline stalls".

提交回复
热议问题