Emulating shifts on 32 bytes with AVX

你说的曾经没有我的故事 提交于 2019-11-27 05:25:40

From different inputs, I gathered these solutions. The key to crossing the inter-lane barrier is the align instruction, _mm256_alignr_epi8.

_mm256_slli_si256(A, N)

0 < N < 16

_mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), 16 - N)

N = 16

_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0))

16 < N < 32

_mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), N - 16)

_mm256_srli_si256(A, N)

0 < N < 16

_mm256_alignr_epi8(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), A, N)

N = 16

_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1))

16 < N < 32

_mm256_srli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), N - 16)

Here is a function to bit shift left a ymm register using avx2. I use it to shift left by one, though it looks like it works for up to 63 bit shifts.

//----------------------------------------------------------------------------
// bit shift left a 256-bit value using ymm registers
//          __m256i *data - data to shift
//          int count     - number of bits to shift
// return:  __m256i       - carry out bit(s)

static __m256i bitShiftLeft256ymm (__m256i *data, int count)
   {
   __m256i innerCarry, carryOut, rotate;

   innerCarry = _mm256_srli_epi64 (*data, 64 - count);                        // carry outs in bit 0 of each qword
   rotate     = _mm256_permute4x64_epi64 (innerCarry, 0x93);                  // rotate ymm left 64 bits
   innerCarry = _mm256_blend_epi32 (_mm256_setzero_si256 (), rotate, 0xFC);   // clear lower qword
   *data      = _mm256_slli_epi64 (*data, count);                             // shift all qwords left
   *data      = _mm256_or_si256 (*data, innerCarry);                          // propagate carrys from low qwords
   carryOut   = _mm256_xor_si256 (innerCarry, rotate);                        // clear all except lower qword
   return carryOut;
   }

//----------------------------------------------------------------------------
Peter Cordes

If the shift count is a multiple of 4 bytes, vpermd (_mm256_permutevar8x32_epi32) with the right shuffle mask will do the trick with one instruction (or more, if you actually need to zero the shifted-in bytes instead of copying a different element over them).

To support variable (multiple-of-4B) shift counts, you could load the control mask from a window into an array of 0 0 0 0 0 0 0 1 2 3 4 5 6 7 0 0 0 0 0 0 0 or something, except that 0 is just the bottom element, and doesn't zero things out. For more on this idea for generating a mask from a sliding window, see my answer on another question.

This answer is pretty minimal, since vpermd doesn't directly solve the problem. I point it out as an alternative that might work in some cases where you're looking for a full vector shift.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!