_mm_alignr_epi8 (PALIGNR) equivalent in AVX2

百般思念 提交于 2019-12-04 06:21:58

What are you using palignr for? If it's only to handle data misalignment, simply use misaligned loads instead; they are generally "fast enough" on modern Intel µ-architectures (and will save you a lot of code size).

If you need palignr-like behavior for some other reason, you can simply take advantage of the unaligned load support to do it in a branch-free manner. Unless you're totally load-store bound, this is probably the preferred idiom.

static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
    // Do whatever your compiler needs to make this buffer 64-byte aligned.
    // You want to avoid the possibility of a page-boundary crossing load.
    char buffer[64];

    // Two aligned stores to fill the buffer.
    _mm256_store_si256((__m256i *)&buffer[0], v0);
    _mm256_store_si256((__m256i *)&buffer[32], v1);

    // Misaligned load to get the data we want.
    return _mm256_loadu_si256((__m256i *)&buffer[n]);
}

If you can provide more information about how exactly you're using palignr, I can probably be more helpful.

We need 2 instructions: “vperm2i128” and “vpalignr” to extend “palignr” on 256 bits.

See: https://software.intel.com/en-us/blogs/2015/01/13/programming-using-avx2-permutations

The only solution I was able to come up with for this is:

static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
  if (n < 16)
  {
    __m128i v0h = _mm256_extractf128_si256(v0, 0);
    __m128i v0l = _mm256_extractf128_si256(v0, 1);
    __m128i v1h = _mm256_extractf128_si256(v1, 0);
    __m128i vouth = _mm_alignr_epi8(v0l, v0h, n);
    __m128i voutl = _mm_alignr_epi8(v1h, v0l, n);
    __m256i vout = _mm256_set_m128i(voutl, vouth);
    return vout;
  }
  else
  {
    __m128i v0h = _mm256_extractf128_si256(v0, 1);
    __m128i v0l = _mm256_extractf128_si256(v1, 0);
    __m128i v1h = _mm256_extractf128_si256(v1, 1);
    __m128i vouth = _mm_alignr_epi8(v0l, v0h, n - 16);
    __m128i voutl = _mm_alignr_epi8(v1h, v0l, n - 16);
    __m256i vout = _mm256_set_m128i(voutl, vouth);
    return vout;
  }
}

which I think is pretty much identical to your solution except it also handles shifts of >= 16 bytes.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!