Horizontal sum of 32-bit floats in 256-bit AVX vector [duplicate]

前端未结

关注

 2  1536

被撕碎了的回忆

相关标签:

2条回答

温柔的废话

2020-12-11 11:17

I would suggest to use 128-bit AVX instructions whenever possible. It will reduce the latency of one cross-domain shuffle (2 cycles on Intel Sandy/Ivy Bridge) and improve efficiency on CPUs which run AVX instructions on 128-bit execution units (currently AMD Bulldozer, Piledriver, Steamroller, and Jaguar):

static inline float _mm256_reduce_add_ps(__m256 x) {
    /* ( x3+x7, x2+x6, x1+x5, x0+x4 ) */
    const __m128 x128 = _mm_add_ps(_mm256_extractf128_ps(x, 1), _mm256_castps256_ps128(x));
    /* ( -, -, x1+x3+x5+x7, x0+x2+x4+x6 ) */
    const __m128 x64 = _mm_add_ps(x128, _mm_movehl_ps(x128, x128));
    /* ( -, -, -, x0+x1+x2+x3+x4+x5+x6+x7 ) */
    const __m128 x32 = _mm_add_ss(x64, _mm_shuffle_ps(x64, x64, 0x55));
    /* Conversion to float is a no-op on x86-64 */
    return _mm_cvtss_f32(x32);
}

0 讨论(0)

梦毁少年i

2020-12-11 11:33
You can emulate a full horizontal add with AVX (i.e. a proper 256 bit version of _mm256_hadd_ps) like this:
```
#define _mm256_full_hadd_ps(v0, v1) \
        _mm256_hadd_ps(_mm256_permute2f128_ps(v0, v1, 0x20), \
                       _mm256_permute2f128_ps(v0, v1, 0x31))
```
If you're just working with one input vector then you may be able to simplify this a little.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题