I want to shift SSE/AVX registers multiples of 32 bits left or right while shifting in zeros.
Let me be more precise on the shifts I\'m interested in. For SSE I wan
Your SSE implementation is fine but I suggest you use the _mm_slli_si128 implementation for both of the shifts - the casts make it look complicated but it really boils down to just one instruction for each shift.
Your AVX2 implementation won't work unfortunately. Almost all AVX instructions are effectively just two SSE instructions in parallel operating on two adjacent 128 bit lanes. So for your first shift_AVX2 example you'd get:
0, 0, 1, 2, 0, 4, 5, 6
----------- ----------
LS lane MS lane
All is not lost however: one of the few instructions which does work across lanes on AVX is _mm256_permutevar8x32_ps. Note that you'll need to use an _mm256_and_ps in conjunction with this to zero the shifted in elements. Note also that this is an AVX2 solution — AVX on its own is very limited for anything other than basic arithmetic/logic operations so I think you'll have a hard time doing this efficiently without AVX2.
You can do a shift right with _mm256_permute_ps, _mm256_permute2f128_ps, and _mm256_blend_ps as follows:
__m256 t0 = _mm256_permute_ps(x, 0x39); // [x4 x7 x6 x5 x0 x3 x2 x1]
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 0x81); // [ 0 0 0 0 x4 x7 x6 x5]
__m256 y = _mm256_blend_ps(t0, t1, 0x88); // [ 0 x7 x6 x5 x4 x3 x2 x1]
The result is in y. In order to do a rotate right, set the permute mask to 0x01 instead of 0x81. Shift/rotate left and larger shifts/rotates can be done similarly by changing the permute and blend control bytes.