I have a 4x4 block of bytes that I\'d like to transpose using general purpose hardware. In other words, for bytes A-P, I\'m looking for the most efficient (in terms of numbe
You want potability and efficiency. Well you can't have it both ways. You said you want to do this with the fewest number of instructions. Well it's possible to do this with only one instruction with SSE3 using the pshufb instruction (see below) from the x86 instruction set.
Maybe ARM Neon has something equivalent. If you want efficiency (and are sure that you need it) then learn the hardware.
The SSE equivalent of _MM_TRANSPOSE4_PS for bytes is to use _mm_shuffle_epi8 (the intrinsic for pshufb) with a mask. Define the mask outside of your main loop.
//use -msse3 with GCC or /arch:SSE2 with MSVC
#include
#include //SSSE3
int main() {
char x[] = {0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,15,16};
__m128i mask = _mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f);
__m128i v = _mm_loadu_si128((__m128i*)x);
v = _mm_shuffle_epi8(v,mask);
_mm_storeu_si128((__m128i*)x,v);
for(int i=0; i<16; i++) printf("%d ", x[i]); printf("\n");
//output: 0 4 8 12 1 5 9 13 2 6 10 15 3 7 11 16
}