I'm working on a port of SSE2 to NEON. The port is early stage and it's producing incorrect results. Part of the reason for the incorrect results is _mm_shuffle_epi32
and the NEON instructions I selected.
The documentation for _mm_shuffle_epi32
is on the lean side from Microsoft. The Intel documentation is better, but it's not clear to me what some of the pseudo-code is doing.
SELECT4(src, control)
{
CASE(control[1:0])
0: tmp[31:0] := src[31:0]
1: tmp[31:0] := src[63:32]
2: tmp[31:0] := src[95:64]
3: tmp[31:0] := src[127:96]
ESAC
RETURN tmp[31:0]
}
dst[31:0] := SELECT4(a[127:0], imm8[1:0])
dst[63:32] := SELECT4(a[127:0], imm8[3:2])
dst[95:64] := SELECT4(a[127:0], imm8[5:4])
dst[127:96] := SELECT4(a[127:0], imm8[7:6])
I need help envisioning what _mm_shuffle_epi32
does. Or more correctly, the permutation applied to the value by the immediate. I guess I need to see it as basic C and ANDs and ORs.
Given C statements and macros like:
v2 = _mm_shuffle_epi32(v1, _MM_SHUFFLE(i1,i2,i3,i4));
What does the resulting C expression look like when it's unrolled into basic C statements?
There's no AND/OR going on, unless you need to unpack the 8bit integer holding four 2bit indices.
Make your own definition for _MM_SHUFFLE
that expands to four args, instead of packing them.
It's something like
// dst = _mm_shuffle_epi32(src, _MM_SHUFFLE(d,c,b,a))
void pshufd(int dst[4], int src[4], int d,int c,int b,int a)
{ // note that the _MM_SHUFFLE args are high-element-first order
dst[0] = src[a];
dst[1] = src[b];
dst[2] = src[c];
dst[3] = src[d];
}
Vectors are indexed from low element = 0. The low element is the one that stores into memory at the lowest address, but when values are in registers you should think about them as [ 3 2 1 0 ]
. In this notation, vector right-shifts (like psrldq
) actually shift to the right.
This is why _mm_set_epi32(3, 2, 1, 0)
takes its args in reverse order from int foo[] = { 0, 1, 2, 3 };
.
When it's not clear what exactly some intrinsic is doing a few sample runs with simple inputs might help as well:
int x[] = {0,1,2,3}, y[4];
__m128i s = _mm_shuffle_epi32(_mm_loadu_si128((__m128i*)x), _MM_SHUFFLE(2, 3, 0, 1));
_mm_store_si128((__m128i*)y, s);
printf("{%d,%d,%d,%d} => {%d,%d,%d,%d}\n", x[0], x[1], x[2], x[3], y[0], y[1], y[2], y[3]);
{0,1,2,3} => {1,0,3,2}
来源:https://stackoverflow.com/questions/37084379/convert-mm-shuffle-epi32-to-c-expression-for-the-permutation