Abstract

Hi, suppose you have two different independent 64-bit binary matrices A and T (T is a transposed version of itself, using the transposed version of matrix allows during multiplication to operate on T's rows rather than columns which is super cool for binary arithmetic) and you want to multiply these matrices the only thing is that matrix multiplication result is truncated to 64-bits and if you yield to a value greater that 1 in some specific matrix cell the resulting matrix cell will contain 1 otherwise 0

Example

   A        T
00000001 01111101 
01010100 01100101 
10010111 00010100 
10110000 00011000 <-- This matrix is transposed
11000100 00111110 
10000011 10101111 
11110101 11000100 
10100000 01100010

Binary and traditional multiplication results:

 Binary  Traditional
11000100  11000100
11111111  32212121
11111111  32213421
11111111  21112211
11101111  22101231
11001111  11001311
11111111  54213432
11001111  11001211

Question

How do you multiply these matrices in a way described above in most efficient matter?

P.S

I was trying to take advantage of binary and (i.e. & operator) instead of performing multiplication on separate bits, in that case I had to prepare data for multiplication:

ulong u;

u = T & 0xFF;
u = (u << 00) + (u << 08) + (u << 16) + (u << 24)
  + (u << 32) + (u << 40) + (u << 48) + (u << 56);

now by performing binary and over two integers A and u it would yield to the following:

   A        u        R        C
00000001 01111101 00000001    1
01010100 01111101 01010100    3
10010111 01111101 00010101    3
10110000 01111101 00110000    2
11000100 01111101 01000100    2
10000011 01111101 00000001    1
11110101 01111101 01110101    5
10100000 01111101 00100000    1

In the example above R contains result of multiplication of A bits to u bits and to obtain the final value we must sum all bits in a row. Notice that column C contains values equal to ones found in first column of resulting Traditional matrix multiplication above. The problem is that during this step I have to operate on a separate bits which I think is sub-optimal approach, I've read through http://graphics.stanford.edu/~seander/bithacks.html looking for a way to do that on parallel but no luck, if anyone has any idea on how to "flatten" and "merge" the values located in R column into resulting 64-bit matrix, I would appreciate if you drop me several lines,

Thank you,

Edit

With big thank you to David Eisenstat, the final algorithm would then look like:

var A = ...;
var T = ...; // T == transpose(t), t is original matrix, algorithm works with transposed matrix

var D = 0x8040201008040201UL;

U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D);

The following piece of code:

    public static void Main (string[] args){
        ulong U;
        var Random = new Xor128 ();

        var timer = DateTime.Now;

        var A = Random.As<IUniformRandom<UInt64>>().Evaluate();
        var T = Random.As<IUniformRandom<UInt64>>().Evaluate();

        var steps = 10000000;

        for (var i = 0; i < steps; i++) {
            ulong r = 0;

            var d = 0x8040201008040201UL;

            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d);
        }

        Console.WriteLine (DateTime.Now - timer);


        var m1 = new Int32[8,8];
        var m2 = new Int32[8,8];
        var m3 = new Int32[8,8];

        for (int row = 0; row < 8; row++) {
            for (int col = 0; col < 8; col++) {
                m1 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
                m2 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
                m3 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
            }
        }

        timer = DateTime.Now;

        for (int i = 0; i < steps; i++) {
            for (int row = 0; row < 8; row++) {
                for (int col = 0; col < 8; col++) {
                    var sum = 0;

                    for (int temp = 0; temp < 8; temp++) {
                        sum += m1 [row, temp] * m2 [temp, row];
                    }

                    m3 [row, col] = sum;
                }
            }
        }

        Console.WriteLine (DateTime.Now - timer);
    }

Shows me the following results:

00:00:02.4035870
00:00:57.5147150

And that's a 23x performance improvement under Mac OS X / Mono, thanks everyone

I'm not sure about most efficient, but here's something to try. The following sequence of instructions computes the main diagonal of the product A * T'. Rotate both T and D by 8 bits and repeat for 7 more iterations.

// uint64_t A, T;
uint64_t D = UINT64_C(0x8040201008040201);
uint64_t P = A & T;
// test whether each byte is nonzero
P |= P >> 1;
P |= P >> 2;
P |= P >> 4;
P &= UINT64_C(0x0101010101010101);
// fill each nonzero byte with ones
P *= 255;  // or P = (P << 8) - P;
// leave only the current diagonal
P &= D;

If you are looking for a way to do dense matrix multiplication in parallel, partition your result matrix into blocks and compute each block in parallel.

http://en.wikipedia.org/wiki/Block_matrix#Block_matrix_multiplication

It is not clear what data structure you are using, which language (yes, I know you said 'any language'), and what you are trying to optimize (speed? memory?) etc. All of these may have profound impact on your solution.

Some examples:

Say this was C/C++, and your matrices are continues bits in memory. Each row/column maps to a UINT8. In this case, multiplying a row with a column reduces to doing an 8-bit bitwise-&, and checking if the result is greater than 0 (no need to sum the bits). This takes 2 processor instruction.
If you are forced to do bit-by-bit operations, use the bitwise 'or' (|) instead of +. Some languages may lazy evaluate this, stopping at the first '1' they encounter.
If you can multi-thread, you could speedup calculations.

BTW, I'm assuming you have a lot of matrices to process, otherwise I would use a direct, and readable code. My guess is that even with a lot of matrices, the gain in performance would be negligible.

If you are allowing more low level construction that C/C++ then SSE/AVX machine instructions together with intrinsic compiler functions allow to write much faster code (4x according to some benchmark I made). You need to use a non standard vector variable (supported at least by GCC, ICC, CLang):

using epu = uint8_t __attribute__((vector_size(16)));

I'm using a class such as

class BMat8 {
    [...]
  private:
    uint64_t _data;
};

then, the following code should do what you want

static constexpr epu rothigh { 0, 1, 2, 3, 4, 5, 6, 7,15, 8, 9,10,11,12,13,14};
static constexpr epu rot2    { 6, 7, 0, 1, 2, 3, 4, 5,14,15, 8, 9,10,11,12,13};

inline BMat8 operator*(BMat8 const& tr) const {
  epu x = _mm_set_epi64x(_data, _data);
  epu y = _mm_shuffle_epi8(_mm_set_epi64x(tr._data, tr._data), rothigh);
  epu data {};
  epu diag =  {0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80,
               0x80,0x01,0x02,0x04,0x08,0x10,0x20,0x40};
  for (int i = 0; i < 4; ++i) {
    data |= ((x & y) != epu {}) & diag;
    y    = _mm_shuffle_epi8(y, rot2);
    diag = _mm_shuffle_epi8(diag, rot2);
  }
  return BMat8(_mm_extract_epi64(data, 0) | _mm_extract_epi64(data, 1));
}

In particular, using 128 bits register, I'm able to do two iterations at once.

The solution for strictly boolean algebra can be achieved pretty efficiently on an x86-64 using the solution I described here:

https://stackoverflow.com/a/55307540/11147804

The only difference is the data from the transposed matrix needs to be extracted also by columns and repacked to rows before each 64-bit product. Fortunately this is trivial to do using the BMI2 instruction for parallel bit extract, accessible on GCC with the intrinsic _pext_u64:

uint64_t mul8x8T (uint64_t A, uint64_t B) {

    const uint64_t COL = 0x0101010101010101;

    uint64_t C = 0;

    for (int i=0; i<8; ++i) {
        uint64_t p = COL & (A>>i); // select column
        uint64_t r = torow( COL & (B>>i) );
        C |= (p*r); // use ^ for GF(2) instead
    }
    return C;
}


uint64_t torow (uint64_t c) {
    const uint64_t ROW = 0x00000000000000FF; // mask of the first row
    const uint64_t COL = 0x0101010101010101; // mask of the first column

    // select bits of c in positions marked by COL,
    // and pack them consecutively
    // last 'and' is included for clarity and is not 
    // really necessary 
    return _pext_u64(c, COL) & ROW;
}

In processors which do not support this particular instruction one possible solution is to adapt the typical bit trick for packing, which is used for example in the classic bit order reversal of a byte using 64-bit multiplication:

https://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith64BitsDiv

Using masks and integer multiplication with some constant results in a quadword containing the packed result as a bit substring which can be then extracted using a bit shift and a mask.

The idea is to think of the multiplication step as a parallel bit shift where every bit in the input is shifted by a different amount, specified in the constant. This is always possible as long as the strides of both numbers do not collide on some position in the result, i.e. as long as each partial sum from the multiplication updates different bit positions in the result. This avoids any potential carries, which makes the bit-by-bit sum equivalent to bit-parallel OR (or XOR).

uint64_t torow (uint64_t c) {
    const uint64_t ROW = 0x00000000000000FF; // select 8 lowest consecutive bits to get the first row
    const uint64_t COL = 0x0101010101010101; // select every 8th bit to get the first column
    const uint64_t DIA = 0x8040201008040201; // select every 8+1 bit to obtain a diagonal

    c *= ROW; // "copies" first column to the rest
    c &= DIA; // only use diagonal bits or else there will be position collisions and unexpected carries
    c *= COL; // "scatters" every bit to all rows after itself; the last row will now contain the packed bits
    return c >> 56; // move last row to first & discard the rest
}

There are other possible alternative implementations of this function using more operations of lower strength, the fastest of which will depend on the target architecture.

来源：https://stackoverflow.com/questions/18447321/binary-matrix-multiplication-bit-twiddling-hack

标签

algorithm