I have 16 byte 'strings' (they may be shorter but you may assume that they are padded with zeros at the end), but you may not assume they are 16 byte aligned (at least not always).
How to write a routine that will compare them (for equality) with SSE intrinsics? I found this code fragment that could be of help but I', not sure if it is appropriate?
register __m128i xmm0, xmm1;
register unsigned int eax;
xmm0 = _mm_load_epi128((__m128i*)(a));
xmm1 = _mm_load_epi128((__m128i*)(b));
xmm0 = _mm_cmpeq_epi8(xmm0, xmm1);
eax = _mm_movemask_epi8(xmm0);
if(eax==0xffff) //equal
else //not equal
Could someone explain this or write a function body?
It needs to work in GCC/mingw (on 32 bit Windows).
Vector comparison instructions produce their result as a mask, of elements that are all-1s (true) or all-0s (false) according to the comparison between the corresponding source elements.
See https://stackoverflow.com/tags/x86/info for some links that will tell you what those intrinsics do.
Your code looks like it should work.
With SSE4.1 (for ptest
) I might try:
__m128i avec, bvec;
avec = _mm_loadu_si128((__m128i*)(a));
bvec = _mm_loadu_si128((__m128i*)(b));
avec = _mm_xor_si128(avec, bvec); // XOR: all zero only if *a==*b
if(_mm_test_all_zeros(avec, avec)) //equal
else //not equal
Using ptest
is only a tiny difference in speed and code size, compared to pcmp / movemask. In this case, ptest
is actually slower. Stgatilov tested it. ptest
is probably faster only if you don't need any extra instruction to build an input for it: test for all-zeros or not, with or without a mask. The negated 1st arg to set the carry flag is rarely useful.
Also, if you want to find out which elements were non-equal, then use the movemask version. You can lzcnt
, popcnt
, or whatever other bit-count operations on the mask, if it's not 0xffff
.
Well, I'm not sure if this would be faster, but it can be done with a single SSE 4.2 instruction-instrinsic: checking PCMPISTRI (Packed Compare Implicit Length Strings, Return Index) for carry and/or overflow flags:
if (_mm_cmpistrc(a, b, mode)) // checks the carry flag (not set = equal)
// equal
else
// unequal
mode would be (for your case):
const int mode =
SIDD_UBYTE_OPS | // 16-bytes per xmm
SIDD_CMP_EQUAL_EACH | // strcmp
SIDD_NEGATIVE_POLARITY; // find first different byte
Unfortunately this instruction is poorly documented. So if anyone finds a decent resource aggregating all combinations of mode and the resulting flags, please share.
I'll try to help with the forgotten Could someone explain this part of the question.
register __m128i xmm0, xmm1;
register unsigned int eax;
Here we declare some variables. __m128i
is a builtin type for integer operations on SSE registers. Note that the names of the variables do not matter at all, but the author has named them exactly as the corresponding CPU registers are called in assembly. xmm0
, xmm1
, xmm2
, xmm3
, ... are all the registers for SSE operations. eax
is one of the general-purpose registers.
register
keyword was used long time ago to advise compiler to place variable in a CPU register. Today it is completely useless, I think. See this question for details.
xmm0 = _mm_loadu_si128((__m128i*)(a));
xmm1 = _mm_loadu_si128((__m128i*)(b));
This code was modified as @harold suggested. Here we load 16 bytes from given memory pointers, which may be unaligned) to variables xmm0
and xmm1
. In assembly code these variables most likely would be located directly in registers, so this intrinsics would generate unaligned memory load. Converting pointer to __m128i*
type is necessary because intrinsic accepts this pointer type, though I have no idea why Intel did it.
xmm0 = _mm_cmpeq_epi8(xmm0, xmm1);
Here we compare for equality each byte from xmm0
variable with corresponding byte in xmm1
variable. Suffix _epi8
means operating on 8-bit elements, i.e. bytes. It is somewhat similar to memcmp(&xmm0, &xmm1, 16)
, but generates other results. It returns a 16-byte value, which contains 0xFF
for each byte with equal values, and 0x00
for each byte with different values.
eax = _mm_movemask_epi8(xmm0);
This is a very important instruction from SSE2, which is usually used to write an if
statement with some SSE condition. It takes the highest bit from each of 16 bytes in XMM argument, and writes them into a single 16-bit integer number. On assembly level, this number is located in general-purpose register, allowing us to check its value quickly just afterwards.
if(eax==0xffff) //equal
else //not equal
If all the 16 bytes of two XMM registers were equal, then _mm_cmpeq_epi8
must return a mask with all 128 bits set. _mm_movemask_epi8
would then return full 16-bit mask, which is 0xFFFF
. If any two compared bytes were different, corresponding byte would be filled with zeros by _mm_cmpeq_epi8
, so _mm_movemask_epi8
would return 16-bit mask with corresponding bit not set, so it would be less than 0xFFFF
.
Also, here is the explained code wrapped into a function:
bool AreEqual(const char *a, const char *b) {
__m128i xmm0, xmm1;
unsigned int eax;
xmm0 = _mm_loadu_si128((__m128i*)(a));
xmm1 = _mm_loadu_si128((__m128i*)(b));
xmm0 = _mm_cmpeq_epi8(xmm0, xmm1);
eax = _mm_movemask_epi8(xmm0);
return (eax == 0xffff); //equal
}
来源:https://stackoverflow.com/questions/31999284/compare-16-byte-strings-with-sse