Fastest way to horizontally sum SSE unsigned byte vector

牧云@^-^@ 提交于 2020-12-23 02:33:19

问题


I need to horizontally add a __m128i that is 16 x epi8 values. The XOP instructions would make this trivial, but I don't have those available.

Current method is:

hd = _mm_hadd_epi16(_mm_cvtepi8_epi16(sum), _mm_cvtepi8_epi16(_mm_shuffle_epi8(sum, swap)));
hd = _mm_hadd_epi16(hd, hd);
hd = _mm_hadd_epi16(hd, hd);

Is there a better way with up to SSE4.1?


回答1:


You can do it with SSE2's _mm_sad_epu8 (psadbw), e.g.:

inline uint32_t _mm_sum_epu8(const __m128i v)
{
    __m128i vsum = _mm_sad_epu8(v, _mm_setzero_si128());
    return _mm_extract_epi16(vsum, 0) + _mm_extract_epi16(vsum, 4);
}


来源:https://stackoverflow.com/questions/36998538/fastest-way-to-horizontally-sum-sse-unsigned-byte-vector

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!