Horizontal minimum and maximum using SSE

后端 未结 2 1775
鱼传尺愫
鱼传尺愫 2020-12-16 01:53

I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the ti

2条回答
  •  再見小時候
    2020-12-16 02:25

    I suggest two changes:

    • Replace ((int8_t*) ((void *) &buffer))[0] with _mm_cvtsi128_si32.
    • Replace _mm_shuffle_epi8 with _mm_shuffle_epi32/_mm_shufflelo_epi16 which have lower latency on recent AMD processors and Intel Atom, and will save you memory load operations:

      static inline int16_t hMin(__m128i buffer)
      {
          buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(3, 2, 3, 2)));
          buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
          buffer = _mm_min_epi8(buffer, _mm_shufflelo_epi16(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
          buffer = _mm_min_epi8(buffer, _mm_srli_epi16(buffer, 8));
          return (int8_t)_mm_cvtsi128_si32(buffer);
      }
      

提交回复
热议问题