I have a 4x4 block of bytes that I\'d like to transpose using general purpose hardware. In other words, for bytes A-P, I\'m looking for the most efficient (in terms of numbe
An efficient solution is possible on a 64 bits machine, if you accept that. First shift the 32 bits integer constants by (0,) 1, 2 and 3 bytes respectively [3 shitfs]. Then mask out the unwanted bits and perform logical ORs [12 ANDs with a constant, 12 ORs]. Finally, shift back to 32 bits [3 shifts] and read out the 32 bits.
ABCD
EFGH
IJKL
MNOP
ABCD
EFGH
IJKL
MNOP
A---
E---
I---
MNOP
=======
AEIMNOP
AEIM
AB--
-F--
-J--
-NOP
=======
ABFJNOP
BFJN
ABC-
--G-
--K-
--OP
=======
ABCGKOP
CGKO
ABCD
---H
---L
---P
=======
ABCDHLP
DHLP