Fast dot product using SSE/AVX intrinsics

后端 未结 2 1655
逝去的感伤
逝去的感伤 2020-12-17 23:31

I am looking for a fast way to calculate the dot product of vectors with 3 or 4 components. I tried several things, but most examples online use an array of floats while our

2条回答
  •  余生分开走
    2020-12-18 00:06

    Algebraically, efficient SIMD looks almost identical to scalar code. So the right way to do the dot product is to operate on four float vectors at once for SEE (eight with AVX).

    Consider constructing your code like this

    #include 
    
    struct float4 {
        __m128 xmm;
        float4 () {};
        float4 (__m128 const & x) { xmm = x; }
        float4 & operator = (__m128 const & x) { xmm = x; return *this; }
        float4 & load(float const * p) { xmm = _mm_loadu_ps(p); return *this; }
        operator __m128() const { return xmm; }
    };
    
    static inline float4 operator + (float4 const & a, float4 const & b) {
        return _mm_add_ps(a, b);
    }
    static inline float4 operator * (float4 const & a, float4 const & b) {
        return _mm_mul_ps(a, b);
    }
    
    struct block3 {
        float4 x, y, z;
    };
    
    struct block4 {
        float4 x, y, z, w;
    };
    
    static inline float4 dot(block3 const & a, block3 const & b) {
        return a.x*b.x + a.y*b.y + a.z*b.z;
    }
    
    static inline float4 dot(block4 const & a, block4 const & b) {
        return a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
    }
    

    Notice that the last two functions look almost identical to your scalar dot function except that float becomes float4 and float4 becomes block3 or block4. This will do the dot product most efficiently.

提交回复
热议问题