I\'m tried to improve performance of copy operation via SSE and AVX:
#include
const int sz = 1024;
float *mas = (float *)_mm_
I think that your main problem/bottleneck is your _mm_malloc.
I highly suggest to use std::vector as your main data structure if you are concerned about locality in C++.
intrinsics are not exactly a "library", they are more like a builtin function provided to you from your compiler, you should be familiar with your compiler internals/docs before using this functions.
Also note that the fact that the AVX are a newer than SSE doesn't make the AVX faster, whatever you are planning to use, the number of cycles taken by an function is probably more important than the "avx vs sse" argument, for example see this answer.
Try with a POD int array[] or an std::vector.