How to Calculate single-vector Dot Product using SSE intrinsic functions in C

后端未结

关注

 4  991

攒了一身酷 2020-12-08 08:12

I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the

4条回答

孤街浪徒 (楼主)

2020-12-08 09:01
If you're doing a dot-product of longer vectors, use multiply and regular _mm_add_ps (or FMA) inside the inner loop. Save the horizontal sum until the end.

But if you are doing a dot product of just a single pair of SIMD vectors:

GCC (at least version 4.3) includes with SSE4.1 level intrinsics, including the single and double-precision dot products:
```
_mm_dp_ps (__m128 __X, __m128 __Y, const int __M);
_mm_dp_pd (__m128d __X, __m128d __Y, const int __M);
```
On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.

But on AMD (including Ryzen), dpps is significantly slower. (See Agner Fog's instruction tables)

As a fallback for older processors, you can use this algorithm to create the dot product of the vectors a and b:
```
__m128 r1 = _mm_mul_ps(a, b);
```
and then horizontal sum r1 using Fastest way to do horizontal float vector sum on x86 (see there for a commented version of this, and why it's faster.)
```
__m128 shuf   = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 3, 0, 1));
__m128 sums   = _mm_add_ps(r1, shuf);
shuf          = _mm_movehl_ps(shuf, sums);
sums          = _mm_add_ss(sums, shuf);
float result =  _mm_cvtss_f32(sums);
```
A slow alternative costs 2 shuffles per hadd, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.
```
r2 = _mm_hadd_ps(r1, r1);
r3 = _mm_hadd_ps(r2, r2);
_mm_store_ss(&result, r3);
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...