Horizontal sum of 32-bit floats in 256-bit AVX vector [duplicate]

白昼怎懂夜的黑 提交于 2019-11-28 14:25:15

I would suggest to use 128-bit AVX instructions whenever possible. It will reduce the latency of one cross-domain shuffle (2 cycles on Intel Sandy/Ivy Bridge) and improve efficiency on CPUs which run AVX instructions on 128-bit execution units (currently AMD Bulldozer, Piledriver, Steamroller, and Jaguar):

static inline float _mm256_reduce_add_ps(__m256 x) {
    /* ( x3+x7, x2+x6, x1+x5, x0+x4 ) */
    const __m128 x128 = _mm_add_ps(_mm256_extractf128_ps(x, 1), _mm256_castps256_ps128(x));
    /* ( -, -, x1+x3+x5+x7, x0+x2+x4+x6 ) */
    const __m128 x64 = _mm_add_ps(x128, _mm_movehl_ps(x128, x128));
    /* ( -, -, -, x0+x1+x2+x3+x4+x5+x6+x7 ) */
    const __m128 x32 = _mm_add_ss(x64, _mm_shuffle_ps(x64, x64, 0x55));
    /* Conversion to float is a no-op on x86-64 */
    return _mm_cvtss_f32(x32);
}

You can emulate a full horizontal add with AVX (i.e. a proper 256 bit version of _mm256_hadd_ps) like this:

#define _mm256_full_hadd_ps(v0, v1) \
        _mm256_hadd_ps(_mm256_permute2f128_ps(v0, v1, 0x20), \
                       _mm256_permute2f128_ps(v0, v1, 0x31))

If you're just working with one input vector then you may be able to simplify this a little.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!