How to properly use prefetch instructions?
问题 I am trying to vectorize a loop, computing dot product of a large float vectors. I am computing it in parallel, utilizing the fact that CPU has large amount of XMM registers, like this: __m128* A, B; __m128 dot0, dot1, dot2, dot3 = _mm_set_ps1(0); for(size_t i=0; i<1048576;i+=4) { dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]); dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]); dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]); dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]);