I want to multiply two matrices, one time by using the straight-forward-algorithm:
template
void multiplicate_straight(T ** A, T ** B, T *
I believe this should do the same thing as the first loop with SSE, assuming sizeX is a multiple of two and the memory is 16-byte aligned.
You may gain a bit more performance by unrolling the loop and using multiple temp variables which you add together at the end. You could also try AVX and the new Fused Multiply Add instruction.
template
void multiplicate_SSE2(T ** A, T ** B, T ** C, int sizeX)
{
T ** D = AllocateDynamicArray2D(sizeX, sizeX);
transpose_matrix(B, D,sizeX);
for(int i = 0; i < sizeX; i++)
{
for(int j = 0; j < sizeX; j++)
{
__m128d temp = _mm_setzero_pd();
for(int g = 0; g < sizeX; g += 2)
{
__m128d a = _mm_load_pd(&A[i][g]);
__m128d b = _mm_load_pd(&D[j][g]);
temp = _mm_add_pd(temp, _mm_mul_pd(a,b));
}
// Add top and bottom half of temp together
temp = _mm_add_pd(temp, _mm_shuffle_pd(temp, temp, 1));
_mm_store_sd(temp, &C[i][j]); // Store one value
}
}
FreeDynamicArray2D(D);
}