SSE, row major vs column major performance issue

◇◆丶佛笑我妖孽 提交于 2019-12-03 08:50:49
Z boson

Don't use _mm_dp_ps for matrix multiplication! I already explained this in great detail at Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point? (incidentally this was my first post on SO).

You don't need anything for more than SSE to do this efficiently (not even SSE2). Use this code to do 4x4 matrix multiplication efficiently. If the matrices are stored in row-major order than do gemm4x4_SSE(A,B,C). If the matrices are stored in column-major order than do gemm4x4_SSE(B,A,C).

void gemm4x4_SSE(float *A, float *B, float *C) {
    __m128 row[4], sum[4];
    for(int i=0; i<4; i++)  row[i] = _mm_load_ps(&B[i*4]);
    for(int i=0; i<4; i++) {
        sum[i] = _mm_setzero_ps();      
        for(int j=0; j<4; j++) {
            sum[i] = _mm_add_ps(_mm_mul_ps(_mm_set1_ps(A[i*4+j]), row[j]), sum[i]);
        }           
    }
    for(int i=0; i<4; i++) _mm_store_ps(&C[i*4], sum[i]); 
}

We actually profiled 3x4 matrix pseudo-multiplication (as-if its a 4x4 affine) and found that in both SSE3 and AVX there was very little difference (<10%) in the column-major vs row-major layouts as long as both are optimized to the limit.

The benchmark https://github.com/buildaworldnet/IrrlichtBAW/blob/master/examples_tests/19.SIMDmatrixMultiplication/main.cpp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!