问题
I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time.
I have been using the following implementation for the minimum for instance:
static inline int16_t hMin(__m128i buffer) {
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m1));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m2));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m3));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m4));
return ((int8_t*) ((void *) &buffer))[0];
}
I need to compute the minimum and the maximum of 16 1-byte integers, as you see.
Any good suggestions are highly appreciated :)
Thanks
回答1:
I suggest two changes:
- Replace
((int8_t*) ((void *) &buffer))[0]
with_mm_cvtsi128_si32
. Replace
_mm_shuffle_epi8
with_mm_shuffle_epi32
/_mm_shufflelo_epi16
which have lower latency on recent AMD processors and Intel Atom, and will save you memory load operations:static inline int16_t hMin(__m128i buffer) { buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(3, 2, 3, 2))); buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(1, 1, 1, 1))); buffer = _mm_min_epi8(buffer, _mm_shufflelo_epi16(buffer, _MM_SHUFFLE(1, 1, 1, 1))); buffer = _mm_min_epi8(buffer, _mm_srli_epi16(buffer, 8)); return (int8_t)_mm_cvtsi128_si32(buffer); }
回答2:
SSE 4.1 has an instruction that does almost what you want. Its name is PHMINPOSUW
, C/C++ intrinsic is _mm_minpos_epu16
. It is limited to 16-bit unsigned values and cannot give maximum, but these problems could be easily solved.
- If you need to find minimum of non-negative bytes, do nothing. If bytes may be negative, add 128 to each. If you need maximum, subtract each from 127.
- Use either
_mm_srli_pi16
or_mm_shuffle_epi8
, and then_mm_min_epu8
to get 8 pairwise minimum values in even bytes and zeros in odd bytes of some XMM register. (These zeros are produced by shift/shuffle instruction and should remain at their places after_mm_min_epu8
). - Use
_mm_minpos_epu16
to find minimum among these values. - Extract the resulting minimum value with
_mm_cvtsi128_si32
. - Undo effect of step 1 to get the original byte value.
Here is an example that returns maximum of 16 signed bytes:
static inline int16_t hMax(__m128i buffer)
{
__m128i tmp1 = _mm_sub_epi8(_mm_set1_epi8(127), buffer);
__m128i tmp2 = _mm_min_epu8(tmp1, _mm_srli_epi16(tmp1, 8));
__m128i tmp3 = _mm_minpos_epu16(tmp2);
return (int8_t)(127 - _mm_cvtsi128_si32(tmp3));
}
来源:https://stackoverflow.com/questions/22256525/horizontal-minimum-and-maximum-using-sse