Fastest way to transpose 4x4 byte matrix

前端 未结 5 2003
猫巷女王i
猫巷女王i 2020-12-19 05:51

I have a 4x4 block of bytes that I\'d like to transpose using general purpose hardware. In other words, for bytes A-P, I\'m looking for the most efficient (in terms of numbe

5条回答
  •  眼角桃花
    2020-12-19 06:32

    You want potability and efficiency. Well you can't have it both ways. You said you want to do this with the fewest number of instructions. Well it's possible to do this with only one instruction with SSE3 using the pshufb instruction (see below) from the x86 instruction set.

    Maybe ARM Neon has something equivalent. If you want efficiency (and are sure that you need it) then learn the hardware.

    The SSE equivalent of _MM_TRANSPOSE4_PS for bytes is to use _mm_shuffle_epi8 (the intrinsic for pshufb) with a mask. Define the mask outside of your main loop.

    //use -msse3 with GCC or /arch:SSE2 with MSVC
    #include 
    #include  //SSSE3
    int main() {
        char x[] = {0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,15,16};
        __m128i mask = _mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f);
    
        __m128i v = _mm_loadu_si128((__m128i*)x);
        v = _mm_shuffle_epi8(v,mask);
        _mm_storeu_si128((__m128i*)x,v);
        for(int i=0; i<16; i++) printf("%d ", x[i]); printf("\n");
        //output: 0 4 8 12 1 5 9 13 2 6 10 15 3 7 11 16   
    }
    

提交回复
热议问题