I was wondering if someone could show me how to use loop tiling/loop blocking for large dense matrix multiplication effectively. I am doing C = AB
The best results I've gotten are by adding one more for
loop that blocks over your N
, and by rearranging the loops. I also hoisted loop-invariant code, but the compiler's optimizer should hopefully do this automatically. The block size should be the cache line size divided by sizeof(float)
. This got it ~50% faster than the transposed approach.
If you have to pick just one of AVX or blocking, using AVX extensions (vfmadd###ps
and haddps
) is still substantially faster. Using both is best and straightforward to add given that you're already testing if the block size is a multiple of 64 / sizeof(float)
== 16 floats == two 256-bit AVX registers.
Tiling:
void matrix_mult_wiki_block(const float*A , const float* B, float* C,
const int N, const int M, const int K) {
const int block_size = 64 / sizeof(float); // 64 = common cache line size
for(int i=0; i N ? N : i0 + block_size;
for (int j0 = 0; j0 < M; j0 += block_size) {
int jmax = j0 + block_size > M ? M : j0 + block_size;
for (int k0 = 0; k0 < K; k0 += block_size) {
int kmax = k0 + block_size > K ? K : k0 + block_size;
for (int j1 = j0; j1 < jmax; ++j1) {
int sj = M * j1;
for (int i1 = i0; i1 < imax; ++i1) {
int mi = M * i1;
int ki = K * i1;
int kij = ki + j1;
for (int k1 = k0; k1 < kmax; ++k1) {
C[kij] += A[mi + k1] * B[sj + k1];
}
}
}
}
}
}
}
As for the Cannon reference, the SUMMA algorithm is a better one to follow.
In case anyone else is optimizing tall-skinny multiplications ({~1e9 x 50} x {50 x 50}, how I ended up here), the transposed approach is nearly identical in performance to the blocked approach up to n=18 (floats). n=18 is a pathological case (way worse than 17 or 19) and I don't quite see the cache access patterns that cause this. All larger n are improved with the blocked approach.