Binary matrix multiplication bit twiddling hack

拟墨画扇 提交于 2019-12-04 05:21:08

I'm not sure about most efficient, but here's something to try. The following sequence of instructions computes the main diagonal of the product A * T'. Rotate both T and D by 8 bits and repeat for 7 more iterations.

// uint64_t A, T;
uint64_t D = UINT64_C(0x8040201008040201);
uint64_t P = A & T;
// test whether each byte is nonzero
P |= P >> 1;
P |= P >> 2;
P |= P >> 4;
P &= UINT64_C(0x0101010101010101);
// fill each nonzero byte with ones
P *= 255;  // or P = (P << 8) - P;
// leave only the current diagonal
P &= D;

If you are looking for a way to do dense matrix multiplication in parallel, partition your result matrix into blocks and compute each block in parallel.

http://en.wikipedia.org/wiki/Block_matrix#Block_matrix_multiplication

It is not clear what data structure you are using, which language (yes, I know you said 'any language'), and what you are trying to optimize (speed? memory?) etc. All of these may have profound impact on your solution.

Some examples:

  • Say this was C/C++, and your matrices are continues bits in memory. Each row/column maps to a UINT8. In this case, multiplying a row with a column reduces to doing an 8-bit bitwise-&, and checking if the result is greater than 0 (no need to sum the bits). This takes 2 processor instruction.
  • If you are forced to do bit-by-bit operations, use the bitwise 'or' (|) instead of +. Some languages may lazy evaluate this, stopping at the first '1' they encounter.
  • If you can multi-thread, you could speedup calculations.

BTW, I'm assuming you have a lot of matrices to process, otherwise I would use a direct, and readable code. My guess is that even with a lot of matrices, the gain in performance would be negligible.

If you are allowing more low level construction that C/C++ then SSE/AVX machine instructions together with intrinsic compiler functions allow to write much faster code (4x according to some benchmark I made). You need to use a non standard vector variable (supported at least by GCC, ICC, CLang):

using epu = uint8_t __attribute__((vector_size(16)));

I'm using a class such as

class BMat8 {
    [...]
  private:
    uint64_t _data;
};

then, the following code should do what you want

static constexpr epu rothigh { 0, 1, 2, 3, 4, 5, 6, 7,15, 8, 9,10,11,12,13,14};
static constexpr epu rot2    { 6, 7, 0, 1, 2, 3, 4, 5,14,15, 8, 9,10,11,12,13};

inline BMat8 operator*(BMat8 const& tr) const {
  epu x = _mm_set_epi64x(_data, _data);
  epu y = _mm_shuffle_epi8(_mm_set_epi64x(tr._data, tr._data), rothigh);
  epu data {};
  epu diag =  {0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80,
               0x80,0x01,0x02,0x04,0x08,0x10,0x20,0x40};
  for (int i = 0; i < 4; ++i) {
    data |= ((x & y) != epu {}) & diag;
    y    = _mm_shuffle_epi8(y, rot2);
    diag = _mm_shuffle_epi8(diag, rot2);
  }
  return BMat8(_mm_extract_epi64(data, 0) | _mm_extract_epi64(data, 1));
}

In particular, using 128 bits register, I'm able to do two iterations at once.

The solution for strictly boolean algebra can be achieved pretty efficiently on an x86-64 using the solution I described here:

https://stackoverflow.com/a/55307540/11147804

The only difference is the data from the transposed matrix needs to be extracted also by columns and repacked to rows before each 64-bit product. Fortunately this is trivial to do using the BMI2 instruction for parallel bit extract, accessible on GCC with the intrinsic _pext_u64:

uint64_t mul8x8T (uint64_t A, uint64_t B) {

    const uint64_t COL = 0x0101010101010101;

    uint64_t C = 0;

    for (int i=0; i<8; ++i) {
        uint64_t p = COL & (A>>i); // select column
        uint64_t r = torow( COL & (B>>i) );
        C |= (p*r); // use ^ for GF(2) instead
    }
    return C;
}


uint64_t torow (uint64_t c) {
    const uint64_t ROW = 0x00000000000000FF; // mask of the first row
    const uint64_t COL = 0x0101010101010101; // mask of the first column

    // select bits of c in positions marked by COL,
    // and pack them consecutively
    // last 'and' is included for clarity and is not 
    // really necessary 
    return _pext_u64(c, COL) & ROW;
}

In processors which do not support this particular instruction one possible solution is to adapt the typical bit trick for packing, which is used for example in the classic bit order reversal of a byte using 64-bit multiplication:

https://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith64BitsDiv

Using masks and integer multiplication with some constant results in a quadword containing the packed result as a bit substring which can be then extracted using a bit shift and a mask.

The idea is to think of the multiplication step as a parallel bit shift where every bit in the input is shifted by a different amount, specified in the constant. This is always possible as long as the strides of both numbers do not collide on some position in the result, i.e. as long as each partial sum from the multiplication updates different bit positions in the result. This avoids any potential carries, which makes the bit-by-bit sum equivalent to bit-parallel OR (or XOR).

uint64_t torow (uint64_t c) {
    const uint64_t ROW = 0x00000000000000FF; // select 8 lowest consecutive bits to get the first row
    const uint64_t COL = 0x0101010101010101; // select every 8th bit to get the first column
    const uint64_t DIA = 0x8040201008040201; // select every 8+1 bit to obtain a diagonal

    c *= ROW; // "copies" first column to the rest
    c &= DIA; // only use diagonal bits or else there will be position collisions and unexpected carries
    c *= COL; // "scatters" every bit to all rows after itself; the last row will now contain the packed bits
    return c >> 56; // move last row to first & discard the rest
}

There are other possible alternative implementations of this function using more operations of lower strength, the fastest of which will depend on the target architecture.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!