Fastest way to transpose 4x4 byte matrix

前端未结

关注

 5  2004

I have a 4x4 block of bytes that I\'d like to transpose using general purpose hardware. In other words, for bytes A-P, I\'m looking for the most efficient (in terms of numbe

相关标签:

5条回答

抹茶落季

2020-12-19 06:15
An efficient solution is possible on a 64 bits machine, if you accept that. First shift the 32 bits integer constants by (0,) 1, 2 and 3 bytes respectively [3 shitfs]. Then mask out the unwanted bits and perform logical ORs [12 ANDs with a constant, 12 ORs]. Finally, shift back to 32 bits [3 shifts] and read out the 32 bits.
```
ABCD
EFGH
IJKL
MNOP

ABCD
 EFGH
  IJKL
   MNOP

A---
 E---
  I---
   MNOP
=======
AEIMNOP
AEIM

AB--
 -F--
  -J--
   -NOP
=======
ABFJNOP
BFJN

ABC-
 --G-
  --K-
   --OP
=======
ABCGKOP
CGKO

ABCD
 ---H
  ---L
   ---P
=======
ABCDHLP
DHLP
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

走了就别回头了

2020-12-19 06:23

Not sure about the speed but these are okay.

template<typename T, std::size_t Size>
void Transpose(T (&Data)[Size][Size])
{
    for (int I = 0; I < Size; ++I)
    {
        for (int J = 0; J < I; ++J)
        {
            std::swap(Data[I][J], Data[J][I]);
        }
    }
}

template<typename T, std::size_t Size>
void Transpose(T (&Data)[Size * Size])
{
    for (int I = 0; I < Size; ++I)
    {
        for (int J = 0; J < I; ++J)
        {
            std::swap(Data[I * Size + J], Data[J * Size + I]);
        }
    }
}

0 讨论(0)

走了就别回头了

2020-12-19 06:24

I posted an answer for this same problem awhile back ago for SSE here.

The only things that need to be added are vectorized load/store operations.

This answer is similar to Z boson's answer to this question. Examples for load/store can be seen there. This answer differs because in addition to the SSE3 implementation there is an SSE2 implementation which is guaranteed to run on any x64 processor.

It's worth noting that both of these solutions assume that the entire matrix is row major in memory, but OP's question states that each row could have it's own pointer which implies the array could be fragmented.

0 讨论(0)
发布评论:

提交评论
- 加载中...

北海茫月

2020-12-19 06:28

Let me rephrase your question: you're asking for a C- or C++-only solution that is portable. Then:

void transpose(uint32_t const in[4], uint32_t out[4]) {
  // A B C D    A E I M
  // E F G H    B F J N
  // I J K L    C G K O
  // M N O P    D H L P

  out[0] = in[0] & 0xFF000000U; // A . . .
  out[1] = in[1] & 0x00FF0000U; // . F . .
  out[2] = in[2] & 0x0000FF00U; // . . K .
  out[3] = in[3] & 0x000000FFU; // . . . P

  out[1] |= (in[0] <<  8) & 0xFF000000U; // B F . .
  out[2] |= (in[0] << 16) & 0xFF000000U; // C . K .
  out[3] |= (in[0] << 24);               // D . . P

  out[0] |= (in[1] >>  8) & 0x00FF0000U; // A E . .
  out[2] |= (in[1] <<  8) & 0x00FF0000U; // C G K .
  out[3] |= (in[1] << 16) & 0x00FF0000U; // D H . P

  out[0] |= (in[2] >> 16) & 0x0000FF00U; // A E I .
  out[1] |= (in[2] >>  8) & 0x0000FF00U; // B F J .
  out[3] |= (in[2] <<  8) & 0x0000FF00U; // D H L P

  out[0] |= (in[3] >> 24);               // A E I M
  out[1] |= (in[3] >>  8) & 0x000000FFU; // B F J N
  out[2] |= (in[3] <<  8) & 0x000000FFU; // C G K O
}

I don't see how it could be answered any other way, since then you'd be depending on a particular compiler compiling it in a particular way, etc.

Of course if those manipulations themselves can be somehow simplified, it'd help. So that's the only avenue of further pursuit here. Nothing stands out so far, but then it's been a long day for me.

So far, the cost is 12 shifts, 12 ORs, 16 ANDs. If the compiler and platform are any good, it can be done in 9 32 bit registers.

If the compiler is very sad, or the platform doesn't have a barrel shifter, then some casting could help extol the fact that the shifts and masks are just byte extractions:

void transpose(uint8_t const in[16], uint8_t out[16]) {
  // A B C D    A E I M
  // E F G H    B F J N
  // I J K L    C G K O
  // M N O P    D H L P

  out[0]  = in[0];  // A . . .
  out[1]  = in[4];  // A E . .
  out[2]  = in[8];  // A E I .
  out[3]  = in[12]; // A E I M
  out[4]  = in[1];  // B . . .
  out[5]  = in[5];  // B F . .
  out[6]  = in[9];  // B F J .
  out[7]  = in[13]; // B F J N
  out[8]  = in[2];  // C . . .
  out[9]  = in[6];  // C G . .
  out[10] = in[10]; // C G K .
  out[11] = in[14]; // C G K O
  out[12] = in[3];  // D . . .
  out[13] = in[7];  // D H . .
  out[14] = in[11]; // D H L .
  out[15] = in[15]; // D H L P
}

If you really want to shuffle it in-place, then the following would do.

void transpose(uint8_t m[16]) {
  std::swap(m[1], m[4]);
  std::swap(m[2], m[8]);
  std::swap(m[3], m[12]);
  std::swap(m[6], m[9]);
  std::swap(m[7], m[13]);
  std::swap(m[11], m[14]);
}

The byte-oriented versions may well produce worse code on modern platforms. Only a benchmark can tell.

0 讨论(0)

眼角桃花

2020-12-19 06:32
You want potability and efficiency. Well you can't have it both ways. You said you want to do this with the fewest number of instructions. Well it's possible to do this with only one instruction with SSE3 using the pshufb instruction (see below) from the x86 instruction set.

Maybe ARM Neon has something equivalent. If you want efficiency (and are sure that you need it) then learn the hardware.

The SSE equivalent of _MM_TRANSPOSE4_PS for bytes is to use _mm_shuffle_epi8 (the intrinsic for pshufb) with a mask. Define the mask outside of your main loop.
```
//use -msse3 with GCC or /arch:SSE2 with MSVC
#include <stdio.h>
#include <tmmintrin.h> //SSSE3
int main() {
    char x[] = {0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,15,16};
    __m128i mask = _mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f);

    __m128i v = _mm_loadu_si128((__m128i*)x);
    v = _mm_shuffle_epi8(v,mask);
    _mm_storeu_si128((__m128i*)x,v);
    for(int i=0; i<16; i++) printf("%d ", x[i]); printf("\n");
    //output: 0 4 8 12 1 5 9 13 2 6 10 15 3 7 11 16   
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...