Calculating matrix product is much slower with SSE than with straight-forward-algorithm

后端 未结 2 1615
滥情空心
滥情空心 2021-01-03 05:40

I want to multiply two matrices, one time by using the straight-forward-algorithm:

template 
void multiplicate_straight(T ** A, T ** B, T *         


        
2条回答
  •  夕颜
    夕颜 (楼主)
    2021-01-03 06:16

    I believe this should do the same thing as the first loop with SSE, assuming sizeX is a multiple of two and the memory is 16-byte aligned.

    You may gain a bit more performance by unrolling the loop and using multiple temp variables which you add together at the end. You could also try AVX and the new Fused Multiply Add instruction.

    template 
    void multiplicate_SSE2(T ** A, T ** B, T ** C, int sizeX)
    {
        T ** D = AllocateDynamicArray2D(sizeX, sizeX);
        transpose_matrix(B, D,sizeX);
        for(int i = 0; i < sizeX; i++)
        {
            for(int j = 0; j < sizeX; j++)
            {
                __m128d temp = _mm_setzero_pd();
                for(int g = 0; g < sizeX; g += 2)
                {
                    __m128d a = _mm_load_pd(&A[i][g]);
                    __m128d b = _mm_load_pd(&D[j][g]);
                    temp = _mm_add_pd(temp, _mm_mul_pd(a,b));
                }
                // Add top and bottom half of temp together
                temp = _mm_add_pd(temp, _mm_shuffle_pd(temp, temp, 1));
                _mm_store_sd(temp, &C[i][j]); // Store one value
            }
        }
        FreeDynamicArray2D(D);
    }
    

提交回复
热议问题