How would you transpose a binary matrix?

前端 未结 7 1016
-上瘾入骨i
-上瘾入骨i 2020-12-16 02:13

I have binary matrices in C++ that I repesent with a vector of 8-bit values.

For example, the following matrix:

1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1         


        
7条回答
  •  情书的邮戳
    2020-12-16 03:06

    I've spent more time looking for a solution, and I've found some good ones.

    The SSE2 way

    On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.

    This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.

    The idea is simple:

    • #include
    • Pack 16 uint8_t variables into an __m128i
    • Use _mm_movemask_epi8 to get the MSBs of each byte, producing an uint16_t
    • Use _mm_slli_epi64 to shift the 128-bit register by one
    • Repeat until you've got all 8 uint16_ts

    A generic 32-bit solution

    Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.

    NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.

    After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.

    • Pack 4 uint8_ts in a 32-bit variable
    • Mask the 1st bit of each byte (using 0x80808080)
    • Multiply it with 0x02040810
    • Take the 4 LSBs of the upper 32 bits of the multiplication
    • Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.

    If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.

提交回复
热议问题