How do I perform 8 x 8 matrix operation using SSE?
My initial attempt looked like this (supposed we want to multiply) __m128 mat[n]; /* rows */ __m128 vec[n] = {1,1,1,1}; float outvector[n]; for (int row=0;row<n;row++) { for(int k =3; k < 8; k = k+ 4) { __m128 mrow = mat[k]; __m128 v = vec[row]; __m128 sum = _mm_mul_ps(mrow,v); sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */ } _mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum)); } But this clearly doesn't work. How do I approach this? I should load 4 at a time.... The other question is: if my array is very big (say n = 1000), how can I make it 16-bytes aligned? Is that even possible?