Transpose a 2D array | 易学教程

问题

How do you efficiently transpose a matrix? Are there libraries for this, or what algorithm would you use?

E.g.:

short src[W*H] = {
  {1,2,3},
  {4,5,6}
};
short dest[W*H];


rotate_90_clockwise(dest,src,W,H); //<-- magic in here, no need for in-place

//dest is now:

{
  {4, 1},
  {5, 2},
  {6, 3}
};

(In my specific case its src array is raw image data, and the destination is a framebuffer, and I'm embedded on ARM on a toolchain that doesn't support assembly)

回答1:

There are libraries for this, in some cases. And, notably, there are tricks you can play with vectorized data (e.g., four 32-bit elements in a 128-bit vector, but this also applies to four 8-bit bytes in a 32-bit register) to go faster than individual-element accesses.

For a transpose, the standard idea is that you use "shuffle" instructions, which allow you to create a new data vector out of two existing vectors, in any order. You work with 4x4 blocks of the input array. So, starting out, you have:

v0 = 1 2 3 4
v1 = 5 6 7 8
v2 = 9 A B C
v3 = D E F 0

Then, you apply shuffle instructions to the first two vectors (interleaving their odd elements, A0B0 C0D0 -> ABCD, and interleaving their even elements, 0A0B 0C0D -> ABCD), and to the last two, to create a new set of vectors with each 2x2 block transposed:

Finally, you apply shuffle instructions to the odd pair and the even pair (combining their first pairs of elements, AB00 CD00 -> ABCD, and their last pairs, 00AB 00CD -> ABCD), to get:

And there, 16 elements transposed in eight instructions!

Now, for 8-bit bytes in 32-bit registers, ARM doesn't have exactly shuffle instructions, but you can synthesize what you need with shifts and a SEL (select) instruction, and the second set of shuffles you can do in one instruction with the PKHBT (pack halfword bottom top) and PKHTB (pack halfword top bottom) instructions.

Finally, if you're using a large ARM processor with NEON vectorizations, you can do something like this with 16-element vectors on 16x16 blocks.

回答2:

One very simple solution that works in O(1) is saving an additional boolean for the matrix, saying whether it is 'transposed' or not. Then accessing the array will be made according to this boolean (row/col or col/row).

Of course, it will impede your cache utilization.

So if you have many transpose operations, and few "complete traversals" (which, btw, might also be re-ordered according to the value of the boolean), this is your best choice.

回答3:

Wikipedia has an entire article on in-place matrix transposition. For non-square matrices, it's a non-trivial, fairly interesting problem (while using less than O(N x M) memory, that is). The article has links to quite a few papers with algorithms, as well as some source code.

Watch out though - as I said in a comment to your question, your demonstration is not of a standard transposition, which all of the algorithms will be written for.

(A standard transposition function will give this result for your example data:)

{
  {1, 4},
  {2, 5},
  {3, 6}
};

If you're just doing this to display an image on a screen, you may be best off just doing the transposition as you copy the image to the back buffer, rather than transposing in-place and then blitting.

回答4:

If matrix is square or if you are not looking for an inplace transposition it's really easy:

Basically you iterate on lines and swap every items with matching column items. You get the matching item by exchanging row and column indexes. When you've treated all columns transposition is finished. You can also go the other way around and iterate on columns.

If you want to increase performance you can copy a full line into a temporary array and the full matching column into another, then copy them back. It should be slightly faster (even if this strategy involve one more variable assignment) if you use a memcopy for transfers involving innermost elements.

If matrix is not square (as in your example) it's really tricky to do it inplace. As transposing doesn't change memory needs it still looks possible to do it inplace, but if you do it carelessly you will end up overwriting elements of another line or column.

If memory is not a bottleneck I recommand using a temporary matrix. It's really easier and it will probably be faster anyway.

The best method is not transposing at all but just setting a flag somewhere stating if you access data row-first or column-first. In most cases algorithms that need transpositions can be rewritten to access to a not transposed matrix as if it were. To achieve this you just have to rewrite some basic operations like matrix products to accept matrixes with one orientation or the other.

But in some cases i understand this will not be possible, typically if data is being prepared for being accessed by some existing hardware or library.

回答5:

The most efficent solution here is to rotate the data as it is being copied from RAM to the framebuffer. Rotating the source in RAM and then copying the result to the framebuffer will be, at best, half the speed of the copy-and-rotate version. So, the question is, is it more efficient to read sequentially and write randomly or read randomly and write sequentially. In code, this would be the choice between:

// read sequential
src = { image data }
dest = framebuffer
for (y = 0 ; y < H ; ++y)
{
   for (x = 0 ; x < W ; ++x)
   {
     pixel = *src++
     dest [y,x] = pixel
   }
}

or:

// write sequential
src = { image data }
dest = framebuffer
for (x = 0 ; x < W ; ++x)
{
   for (y = 0 ; y < H ; ++y)
   {
     pixel = src [x,y]
     *dest++ = pixel
   }
}

The answer to this can only be determined by profiling the code.

Now, it may be that you have a GPU in which case it would certainly have the ability to do rotations and it will be far more efficient to let the GPU do the rotation when blitting the image to the screen.

回答6:

Just a simple copy to temp and copy-back, transposing as you go, using pointer-stepping to avoid the multiply in address calculation, and the inner loop unrolled:

char temp[W*H];
char* ptemp = temp;
memcpy(temp, array, sizeof(char)*W*H);
for (i = 0; i < H; i++){
    char* parray = &array[i];
    for (j = 0; j+8 <= W; j += 8, ptemp += 8){
        *parray = ptemp[0]; parray += H;
        *parray = ptemp[1]; parray += H;
        *parray = ptemp[2]; parray += H;
        *parray = ptemp[3]; parray += H;
        *parray = ptemp[4]; parray += H;
        *parray = ptemp[5]; parray += H;
        *parray = ptemp[6]; parray += H;
        *parray = ptemp[7]; parray += H;
    }
    for (; j < W; j++, parray += H){
        *parray = *ptemp++;
    }
}

I don't know how to avoid the cache-locality issue because of the nature of the problem.

来源：https://stackoverflow.com/questions/1453033/transpose-a-2d-array

标签

performance

algorithm

embedded

matrix