Emulating shifts on 32 bytes with AVX
问题 I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics. Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately and zeroes are introduced in between. (This is by contrast with _mm_slli_si128 and _mm_srli_si128 that handle whole SSE registers.) Can you recommend me a short substitute ? UPDATE: _mm256_slli_si256 is efficiently achieved with _mm256_alignr_epi8