Fast counting the number of equal bytes between two arrays

匿名 (未验证) 提交于 2019-12-03 01:59:02

问题:

I wrote the function int compare_16bytes(__m128i lhs, __m128i rhs) in order to compare two 16 byte numbers using SSE instructions: this function returns how many bytes are equal after performing the comparison.

Now I would like use the above function in order to compare two byte arrays of arbitrary length: the length may not be a multiple of 16 bytes, so I need deal with this problem. How could I complete the implementation of the function below? How could I improve the function below?

int fast_compare(const char* s, const char* t, int length) {     int result = 0;      const char* sPtr = s;     const char* tPtr = t;      while(...)     {         const __m128i* lhs = (const __m128i*)sPtr;         const __m128i* rhs = (const __m128i*)tPtr;          // compare the next 16 bytes of s and t         result += compare_16bytes(*lhs,*rhs);          sPtr += 16;         tPtr += 16;     }      return result; } 

回答1:

As @Mysticial says in the comments above, do the compare and sum vertically and then just sum horizontally at the end of the main loop:

#include  #include  #include  #include   // reference implementation int fast_compare_ref(const char *s, const char *t, int length) {     int result = 0;     int i;      for (i = 0; i 

Compile and run the above test harness:

$ gcc -Wall -O3 -msse3 fast_compare.c -o fast_compare $ ./fast_compare result_ref = 3955, result = 3955 $ ./fast_compare result_ref = 3947, result = 3947 $ ./fast_compare result_ref = 3945, result = 3945 

Note that there is one possibly non-obvious trick in the above SSE code where we use _mm_madd_epi16 to unpack and accumulate 16 bit 0/-1 values to 32 bit partial sums. We take advantage of the fact that -1*-1 = 1 (and 0*0 = 0 of course) - we're not really doing a multiply here, just unpacking and summing in one instruction.


UPDATE: as noted in the comments below, this solution is not optimal - I just took a fairly optimal 16 bit solution and added 8 bit to 16 bit unpacking to make it work for 8 bit data. However for 8 bit data there are more efficient methods, e.g. using psadbw/_mm_sad_epu8. I'll leave this answer here for posterity, and for anyone who might want to do this kind of thing with 16 bit data, but really one of the other answers which doesn't require unpacking the input data should be the accepted answer.



回答2:

Using partial sums in 16 x uint8 elements may give even better performance.
I have divided the loop into inner loop and outer loop.
The inner loop sum uint8 elements (each uint8 element can sum up to 255 "1"s).
Small trick: _mm_cmpeq_epi8 set equal elements to 0xFF, and (char)0xFF = -1, so you can subtract the result from the sum (subtract -1 for adding 1).

Here is my optimized version for fast_compare:

int fast_compare2(const char *s, const char *t, int length) {     int result = 0;     int inner_length = length;     int i;     int j = 0;      //Points beginning of 4080 elements block.     const char *s0 = s;     const char *t0 = t;       __m128i vsum = _mm_setzero_si128();      //Outer loop sum result of 4080 sums.     for (i = 0; i 


回答3:

The fastest way for large inputs is Rotem's answer, where the inner loop is pcmpeqb / psubb, breaking out to horizontally sum before any byte element of the vector accumulator overflows. Do the hsum of unsigned bytes with psadbw against an all-zero vector.

Without unrolling / nested loops, the best option is probably

pcmpeqb   -> vector of  0  or  0xFF  elements psadbw    -> two 64bit sums of  (0*no_matches + 0xFF*matches) paddq     -> accumulate the psadbw result in a vector accumulator  #outside the loop: horizontal sum divide the result by 255 

If you don't have a lot of register pressure in your loop, psadbw against a vector of 0x7f instead of all-zero.

  • psadbw(0x00, set1(0x7f)) => sum += 0x7f
  • psadbw(0xff, set1(0x7f)) => sum += 0x80

So instead of dividing by 255 (which the compiler should do efficiently without an actual div), you just have to subtract n * 0x7f, where n is the number of elements.

Also note that paddq is slow on pre-Nehalem, and Atom, so you could use paddd (_mm_add_epi32) if you don't expect 128 * the count to ever overflow a 32bit integer.

This compares very well with the Paul R's pcmpeqb / 2x punpck / 2x pmaddwd / 2x paddw.



回答4:

The integer comparison in SSE produces bytes that either all zeros or all ones. If you want to count, you first need to right shift (not arithmetic) the comparison result by 7, then add to the result vector. At the end, you still need to reduce the result vector by summing its elements. This reduction has to be done in scalar code, or with a sequence of add/shifts. Usually this part is not worth troubling with.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!