问题
In SSE3, the PALIGNR instruction performs the following:
PALIGNR concatenates the destination operand (the first operand) and the source operand (the second operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result into the destination.
I'm currently in the midst of porting my SSE4 code to use AVX2 instructions and working on 256bit registers instead of 128bit.
Naively, I believed that the intrinsics function _mm256_alignr_epi8
(VPALIGNR) performs the same operation as _mm_alignr_epi8
only on 256bit registers. Sadly however, that is not exactly the case. In fact, _mm256_alignr_epi8
treats the 256bit register as 2 128bit registers and performs 2 "align" operations on the two neighboring 128bit registers. Effectively performing the same operation as _mm_alignr_epi8
but on 2 registers at once. It's most clearly illustrated here: _mm256_alignr_epi8
Currently my solution is to keep using _mm_alignr_epi8
by splitting the ymm (256bit) registers into two xmm (128bit) registers (high and low), like so:
__m128i xmm_ymm1_hi = _mm256_extractf128_si256(ymm1, 0);
__m128i xmm_ymm1_lo = _mm256_extractf128_si256(ymm1, 1);
__m128i xmm_ymm2_hi = _mm256_extractf128_si256(ymm2, 0);
__m128i xmm_ymm_aligned_lo = _mm_alignr_epi8(xmm_ymm1_lo, xmm_ymm1_hi, 1);
__m128i xmm_ymm_aligned_hi = _mm_alignr_epi8(xmm_ymm2_hi, xmm_ymm1_lo, 1);
__m256i xmm_ymm_aligned = _mm256_set_m128i(xmm_ymm_aligned_lo, xmm_ymm_aligned_hi);
This works, but there has to be a better way, right? Is there a perhaps more "general" AVX2 instruction that should be using to get the same result?
回答1:
What are you using palignr
for? If it's only to handle data misalignment, simply use misaligned loads instead; they are generally "fast enough" on modern Intel µ-architectures (and will save you a lot of code size).
If you need palignr
-like behavior for some other reason, you can simply take advantage of the unaligned load support to do it in a branch-free manner. Unless you're totally load-store bound, this is probably the preferred idiom.
static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
// Do whatever your compiler needs to make this buffer 64-byte aligned.
// You want to avoid the possibility of a page-boundary crossing load.
char buffer[64];
// Two aligned stores to fill the buffer.
_mm256_store_si256((__m256i *)&buffer[0], v0);
_mm256_store_si256((__m256i *)&buffer[32], v1);
// Misaligned load to get the data we want.
return _mm256_loadu_si256((__m256i *)&buffer[n]);
}
If you can provide more information about how exactly you're using palignr
, I can probably be more helpful.
回答2:
We need 2 instructions: “vperm2i128” and “vpalignr” to extend “palignr” on 256 bits.
See: https://software.intel.com/en-us/blogs/2015/01/13/programming-using-avx2-permutations
回答3:
The only solution I was able to come up with for this is:
static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
if (n < 16)
{
__m128i v0h = _mm256_extractf128_si256(v0, 0);
__m128i v0l = _mm256_extractf128_si256(v0, 1);
__m128i v1h = _mm256_extractf128_si256(v1, 0);
__m128i vouth = _mm_alignr_epi8(v0l, v0h, n);
__m128i voutl = _mm_alignr_epi8(v1h, v0l, n);
__m256i vout = _mm256_set_m128i(voutl, vouth);
return vout;
}
else
{
__m128i v0h = _mm256_extractf128_si256(v0, 1);
__m128i v0l = _mm256_extractf128_si256(v1, 0);
__m128i v1h = _mm256_extractf128_si256(v1, 1);
__m128i vouth = _mm_alignr_epi8(v0l, v0h, n - 16);
__m128i voutl = _mm_alignr_epi8(v1h, v0l, n - 16);
__m256i vout = _mm256_set_m128i(voutl, vouth);
return vout;
}
}
which I think is pretty much identical to your solution except it also handles shifts of >= 16 bytes.
来源:https://stackoverflow.com/questions/8517970/mm-alignr-epi8-palignr-equivalent-in-avx2