Why does _mm_stream_ps produce L1/LL cache misses?

こ雲淡風輕ζ 提交于 2019-12-02 21:15:47
  1. Probably, your benchmark measures mostly memory allocation performance, not only write performance. Your OS may allocate memory pages not in malloc, but on first touch, inside your func* functions. OS may also do some memory shuffles after large amount of memory is allocated, so any benchmarks, performed just after memory allocations, may be not reliable.
  2. Your code has aliasing problem: compiler cannot guarantee that your array's pointer does not change in the process of filling this array, so it has to always load arr value from memory instead of using a register. This may cost some performance decrease. Easiest way to avoid aliasing is to copy arr and length to local variables and use only local variables to fill the array. There are many well-known advices to avoid global variables. Aliasing is one of the reasons.
  3. _mm_stream_ps works better if array is aligned by 64 bytes. In your code no alignment is guaranteed (actually, malloc aligns it by 16 bytes). This optimization is noticeable only for short arrays.
  4. It is a good idea to call _mm_mfence after you finished with _mm_stream_ps. This is needed for correctness, not for performance.

Shouldn't func4 be this:

void func4() {
    __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
    for(int i = 0; i < length; i += 16) {
        _mm_stream_ps(&arr[i], buf);
        _mm_stream_ps(&arr[i+4], buf);
        _mm_stream_ps(&arr[i+8], buf);
        _mm_stream_ps(&arr[i+12], buf);
    }
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!