sse

Fastest way to horizontally sum SSE unsigned byte vector

牧云@^-^@ 提交于 2020-12-23 02:33:19
问题 I need to horizontally add a __m128i that is 16 x epi8 values. The XOP instructions would make this trivial, but I don't have those available. Current method is: hd = _mm_hadd_epi16(_mm_cvtepi8_epi16(sum), _mm_cvtepi8_epi16(_mm_shuffle_epi8(sum, swap))); hd = _mm_hadd_epi16(hd, hd); hd = _mm_hadd_epi16(hd, hd); Is there a better way with up to SSE4.1? 回答1: You can do it with SSE2's _mm_sad_epu8 (psadbw), e.g.: inline uint32_t _mm_sum_epu8(const __m128i v) { __m128i vsum = _mm_sad_epu8(v, _mm

Does SSE/AVX provide a means of determining if a result was rounded up?

≡放荡痞女 提交于 2020-12-09 12:20:55
问题 One of the purposes of the C1 bit in the x87 FPU status word is to show whether or not an inexact result was rounded up. Does SSE/AVX provide any such indication for scalar operations? I did not see a similar bit in the MXCSR register. Am I forced to use x87 instructions if I want this information? 回答1: SSE/AVX do not provide hardware support for detecting this, even for scalar instructions like addss . SSE was designed for SIMD, with 4 floats per XMM vector, and presumably Intel didn't want