Fastest way to horizontally sum SSE unsigned byte vector
问题 I need to horizontally add a __m128i that is 16 x epi8 values. The XOP instructions would make this trivial, but I don't have those available. Current method is: hd = _mm_hadd_epi16(_mm_cvtepi8_epi16(sum), _mm_cvtepi8_epi16(_mm_shuffle_epi8(sum, swap))); hd = _mm_hadd_epi16(hd, hd); hd = _mm_hadd_epi16(hd, hd); Is there a better way with up to SSE4.1? 回答1: You can do it with SSE2's _mm_sad_epu8 (psadbw), e.g.: inline uint32_t _mm_sum_epu8(const __m128i v) { __m128i vsum = _mm_sad_epu8(v, _mm