Fastest way to transpose 4x4 byte matrix

前端 未结 5 1998
猫巷女王i
猫巷女王i 2020-12-19 05:51

I have a 4x4 block of bytes that I\'d like to transpose using general purpose hardware. In other words, for bytes A-P, I\'m looking for the most efficient (in terms of numbe

相关标签:
5条回答
  • 2020-12-19 06:15

    An efficient solution is possible on a 64 bits machine, if you accept that. First shift the 32 bits integer constants by (0,) 1, 2 and 3 bytes respectively [3 shitfs]. Then mask out the unwanted bits and perform logical ORs [12 ANDs with a constant, 12 ORs]. Finally, shift back to 32 bits [3 shifts] and read out the 32 bits.

    ABCD
    EFGH
    IJKL
    MNOP
    
    ABCD
     EFGH
      IJKL
       MNOP
    
    A---
     E---
      I---
       MNOP
    =======
    AEIMNOP
    AEIM
    
    AB--
     -F--
      -J--
       -NOP
    =======
    ABFJNOP
    BFJN
    
    ABC-
     --G-
      --K-
       --OP
    =======
    ABCGKOP
    CGKO
    
    ABCD
     ---H
      ---L
       ---P
    =======
    ABCDHLP
    DHLP
    
    0 讨论(0)
  • 2020-12-19 06:23

    Not sure about the speed but these are okay.

    template<typename T, std::size_t Size>
    void Transpose(T (&Data)[Size][Size])
    {
        for (int I = 0; I < Size; ++I)
        {
            for (int J = 0; J < I; ++J)
            {
                std::swap(Data[I][J], Data[J][I]);
            }
        }
    }
    
    template<typename T, std::size_t Size>
    void Transpose(T (&Data)[Size * Size])
    {
        for (int I = 0; I < Size; ++I)
        {
            for (int J = 0; J < I; ++J)
            {
                std::swap(Data[I * Size + J], Data[J * Size + I]);
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-19 06:24

    I posted an answer for this same problem awhile back ago for SSE here.

    The only things that need to be added are vectorized load/store operations.

    This answer is similar to Z boson's answer to this question. Examples for load/store can be seen there. This answer differs because in addition to the SSE3 implementation there is an SSE2 implementation which is guaranteed to run on any x64 processor.

    It's worth noting that both of these solutions assume that the entire matrix is row major in memory, but OP's question states that each row could have it's own pointer which implies the array could be fragmented.

    0 讨论(0)
  • 2020-12-19 06:28

    Let me rephrase your question: you're asking for a C- or C++-only solution that is portable. Then:

    void transpose(uint32_t const in[4], uint32_t out[4]) {
      // A B C D    A E I M
      // E F G H    B F J N
      // I J K L    C G K O
      // M N O P    D H L P
    
      out[0] = in[0] & 0xFF000000U; // A . . .
      out[1] = in[1] & 0x00FF0000U; // . F . .
      out[2] = in[2] & 0x0000FF00U; // . . K .
      out[3] = in[3] & 0x000000FFU; // . . . P
    
      out[1] |= (in[0] <<  8) & 0xFF000000U; // B F . .
      out[2] |= (in[0] << 16) & 0xFF000000U; // C . K .
      out[3] |= (in[0] << 24);               // D . . P
    
      out[0] |= (in[1] >>  8) & 0x00FF0000U; // A E . .
      out[2] |= (in[1] <<  8) & 0x00FF0000U; // C G K .
      out[3] |= (in[1] << 16) & 0x00FF0000U; // D H . P
    
      out[0] |= (in[2] >> 16) & 0x0000FF00U; // A E I .
      out[1] |= (in[2] >>  8) & 0x0000FF00U; // B F J .
      out[3] |= (in[2] <<  8) & 0x0000FF00U; // D H L P
    
      out[0] |= (in[3] >> 24);               // A E I M
      out[1] |= (in[3] >>  8) & 0x000000FFU; // B F J N
      out[2] |= (in[3] <<  8) & 0x000000FFU; // C G K O
    }
    

    I don't see how it could be answered any other way, since then you'd be depending on a particular compiler compiling it in a particular way, etc.

    Of course if those manipulations themselves can be somehow simplified, it'd help. So that's the only avenue of further pursuit here. Nothing stands out so far, but then it's been a long day for me.

    So far, the cost is 12 shifts, 12 ORs, 16 ANDs. If the compiler and platform are any good, it can be done in 9 32 bit registers.

    If the compiler is very sad, or the platform doesn't have a barrel shifter, then some casting could help extol the fact that the shifts and masks are just byte extractions:

    void transpose(uint8_t const in[16], uint8_t out[16]) {
      // A B C D    A E I M
      // E F G H    B F J N
      // I J K L    C G K O
      // M N O P    D H L P
    
      out[0]  = in[0];  // A . . .
      out[1]  = in[4];  // A E . .
      out[2]  = in[8];  // A E I .
      out[3]  = in[12]; // A E I M
      out[4]  = in[1];  // B . . .
      out[5]  = in[5];  // B F . .
      out[6]  = in[9];  // B F J .
      out[7]  = in[13]; // B F J N
      out[8]  = in[2];  // C . . .
      out[9]  = in[6];  // C G . .
      out[10] = in[10]; // C G K .
      out[11] = in[14]; // C G K O
      out[12] = in[3];  // D . . .
      out[13] = in[7];  // D H . .
      out[14] = in[11]; // D H L .
      out[15] = in[15]; // D H L P
    }
    

    If you really want to shuffle it in-place, then the following would do.

    void transpose(uint8_t m[16]) {
      std::swap(m[1], m[4]);
      std::swap(m[2], m[8]);
      std::swap(m[3], m[12]);
      std::swap(m[6], m[9]);
      std::swap(m[7], m[13]);
      std::swap(m[11], m[14]);
    }
    

    The byte-oriented versions may well produce worse code on modern platforms. Only a benchmark can tell.

    0 讨论(0)
  • 2020-12-19 06:32

    You want potability and efficiency. Well you can't have it both ways. You said you want to do this with the fewest number of instructions. Well it's possible to do this with only one instruction with SSE3 using the pshufb instruction (see below) from the x86 instruction set.

    Maybe ARM Neon has something equivalent. If you want efficiency (and are sure that you need it) then learn the hardware.

    The SSE equivalent of _MM_TRANSPOSE4_PS for bytes is to use _mm_shuffle_epi8 (the intrinsic for pshufb) with a mask. Define the mask outside of your main loop.

    //use -msse3 with GCC or /arch:SSE2 with MSVC
    #include <stdio.h>
    #include <tmmintrin.h> //SSSE3
    int main() {
        char x[] = {0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,15,16};
        __m128i mask = _mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f);
    
        __m128i v = _mm_loadu_si128((__m128i*)x);
        v = _mm_shuffle_epi8(v,mask);
        _mm_storeu_si128((__m128i*)x,v);
        for(int i=0; i<16; i++) printf("%d ", x[i]); printf("\n");
        //output: 0 4 8 12 1 5 9 13 2 6 10 15 3 7 11 16   
    }
    
    0 讨论(0)
提交回复
热议问题