I\'m tried to improve performance of copy operation via SSE and AVX:
#include
const int sz = 1024;
float *mas = (float *)_mm_
Writing fast SSE is not as simple as using SSE operations in place of their non-parallel equivalents. In this case I suspect your compiler cannot usefully unroll the load/store pair and your time is dominated by stalls caused by using the output of one low-throughput operation (the load) in the very next instruction (the store).
You can test this idea by manually unrolling one notch:
//SSE-copy testing
start2 = std::chrono::system_clock::now();
for(int i=0; i
Normally when using intrinsics I disassemble the output and make sure nothing crazy is going on (you could try this to verify if/how the original loop got unrolled). For more complex loops the right tool to use is the Intel Architecture Code Analyzer (IACA). It's a static analysis tool which can tell you things like "you have pipeline stalls".