Fastest way to transpose 4x4 byte matrix

前端 未结 5 1999
猫巷女王i
猫巷女王i 2020-12-19 05:51

I have a 4x4 block of bytes that I\'d like to transpose using general purpose hardware. In other words, for bytes A-P, I\'m looking for the most efficient (in terms of numbe

5条回答
  •  北海茫月
    2020-12-19 06:28

    Let me rephrase your question: you're asking for a C- or C++-only solution that is portable. Then:

    void transpose(uint32_t const in[4], uint32_t out[4]) {
      // A B C D    A E I M
      // E F G H    B F J N
      // I J K L    C G K O
      // M N O P    D H L P
    
      out[0] = in[0] & 0xFF000000U; // A . . .
      out[1] = in[1] & 0x00FF0000U; // . F . .
      out[2] = in[2] & 0x0000FF00U; // . . K .
      out[3] = in[3] & 0x000000FFU; // . . . P
    
      out[1] |= (in[0] <<  8) & 0xFF000000U; // B F . .
      out[2] |= (in[0] << 16) & 0xFF000000U; // C . K .
      out[3] |= (in[0] << 24);               // D . . P
    
      out[0] |= (in[1] >>  8) & 0x00FF0000U; // A E . .
      out[2] |= (in[1] <<  8) & 0x00FF0000U; // C G K .
      out[3] |= (in[1] << 16) & 0x00FF0000U; // D H . P
    
      out[0] |= (in[2] >> 16) & 0x0000FF00U; // A E I .
      out[1] |= (in[2] >>  8) & 0x0000FF00U; // B F J .
      out[3] |= (in[2] <<  8) & 0x0000FF00U; // D H L P
    
      out[0] |= (in[3] >> 24);               // A E I M
      out[1] |= (in[3] >>  8) & 0x000000FFU; // B F J N
      out[2] |= (in[3] <<  8) & 0x000000FFU; // C G K O
    }
    

    I don't see how it could be answered any other way, since then you'd be depending on a particular compiler compiling it in a particular way, etc.

    Of course if those manipulations themselves can be somehow simplified, it'd help. So that's the only avenue of further pursuit here. Nothing stands out so far, but then it's been a long day for me.

    So far, the cost is 12 shifts, 12 ORs, 16 ANDs. If the compiler and platform are any good, it can be done in 9 32 bit registers.

    If the compiler is very sad, or the platform doesn't have a barrel shifter, then some casting could help extol the fact that the shifts and masks are just byte extractions:

    void transpose(uint8_t const in[16], uint8_t out[16]) {
      // A B C D    A E I M
      // E F G H    B F J N
      // I J K L    C G K O
      // M N O P    D H L P
    
      out[0]  = in[0];  // A . . .
      out[1]  = in[4];  // A E . .
      out[2]  = in[8];  // A E I .
      out[3]  = in[12]; // A E I M
      out[4]  = in[1];  // B . . .
      out[5]  = in[5];  // B F . .
      out[6]  = in[9];  // B F J .
      out[7]  = in[13]; // B F J N
      out[8]  = in[2];  // C . . .
      out[9]  = in[6];  // C G . .
      out[10] = in[10]; // C G K .
      out[11] = in[14]; // C G K O
      out[12] = in[3];  // D . . .
      out[13] = in[7];  // D H . .
      out[14] = in[11]; // D H L .
      out[15] = in[15]; // D H L P
    }
    

    If you really want to shuffle it in-place, then the following would do.

    void transpose(uint8_t m[16]) {
      std::swap(m[1], m[4]);
      std::swap(m[2], m[8]);
      std::swap(m[3], m[12]);
      std::swap(m[6], m[9]);
      std::swap(m[7], m[13]);
      std::swap(m[11], m[14]);
    }
    

    The byte-oriented versions may well produce worse code on modern platforms. Only a benchmark can tell.

提交回复
热议问题