I have just started using SSE and I am confused how to get the maximum integer value (max
) of a __m128i
. For instance:
__m128i t =
According to this page, there is no horizontal max, and you need to test the elements vertically:
movhlps xmm1,xmm0 ; Move top two floats to lower part of xmm1
maxps xmm0,xmm1 ; Get maximum of the two sets of floats
pshufd xmm1,xmm0,$55 ; Move second float to lower part of xmm1
maxps xmm0,xmm1 ; Get minimum of the two remaining floats
Conversely, getting the minimum:
movhlps xmm1,xmm0
minps xmm0,xmm1
pshufd xmm1,xmm0,$55
minps xmm0,xmm1
There is no Horizontal Maximum opcode in SSE (at least up until the point where I stopped keep track of new SSE instructions).
So you are stuck doing some shuffling. What you end up with is...
movhlps %xmm0, %xmm1 # Move top two floats to lower part of %xmm1
maxps %xmm1, %xmm0 # Get minimum of sets of two floats
pshufd $0x55, %xmm0, %xmm1 # Move second float to lower part of %xmm1
maxps %xmm1, %xmm0 # Get minimum of all four floats originally in %xmm0
http://locklessinc.com/articles/instruction_wishlist/
MSDN has the intrinsic and macro function mappings documented
http://msdn.microsoft.com/en-us/library/t467de55.aspx
If you find yourself needing to do horizontal operations on vectors, especially if it's inside an inner loop, then it's usually a sign that you are approaching your SIMD implementation in the wrong way. SIMD likes to operate element-wise on vectors - "vertically" if you like, not horizontally.
As for documentation, there is a very useful reference on intel.com which contains all the opcodes and intrinsics for everything from MMX through the various flavours of SSE all the way up to AVX and AVX-512.
In case anyone cares and since intrinsics seem to be the way to go these days here is a solution in terms of intrinsics.
int horizontal_max_Vec4i(__m128i x) {
__m128i max1 = _mm_shuffle_epi32(x, _MM_SHUFFLE(0,0,3,2));
__m128i max2 = _mm_max_epi32(x,max1);
__m128i max3 = _mm_shuffle_epi32(max2, _MM_SHUFFLE(0,0,0,1));
__m128i max4 = _mm_max_epi32(max2,max3);
return _mm_cvtsi128_si32(max4);
}
I don't know if that's any better than this:
int horizontal_max_Vec4i(__m128i x) {
int result[4] __attribute__((aligned(16))) = {0};
_mm_store_si128((__m128i *) result, x);
return max(max(max(result[0], result[1]), result[2]), result[3]);
}