loop tiling/blocking for large dense matrix multiplication

前端未结

关注

 1  1276

野性不改 2020-12-30 03:11

I was wondering if someone could show me how to use loop tiling/loop blocking for large dense matrix multiplication effectively. I am doing C = AB

1条回答

执笔经年 (楼主)

2020-12-30 03:32
The best results I've gotten are by adding one more for loop that blocks over your N, and by rearranging the loops. I also hoisted loop-invariant code, but the compiler's optimizer should hopefully do this automatically. The block size should be the cache line size divided by sizeof(float). This got it ~50% faster than the transposed approach.

If you have to pick just one of AVX or blocking, using AVX extensions (vfmadd###ps and haddps) is still substantially faster. Using both is best and straightforward to add given that you're already testing if the block size is a multiple of 64 / sizeof(float) == 16 floats == two 256-bit AVX registers.
- Transposed: 1,816,522 ticks
- Tiling: 892,431 (51% faster)
- AVX+tiling: 230,512 (87% faster)
Tiling:
```
void matrix_mult_wiki_block(const float*A , const float* B, float* C,
                            const int N, const int M, const int K) {
    const int block_size = 64 / sizeof(float); // 64 = common cache line size
    for(int i=0; i N ? N : i0 + block_size;

        for (int j0 = 0; j0 < M; j0 += block_size) {
            int jmax = j0 + block_size > M ? M : j0 + block_size;

            for (int k0 = 0; k0 < K; k0 += block_size) {
                int kmax = k0 + block_size > K ? K : k0 + block_size;

                for (int j1 = j0; j1 < jmax; ++j1) {
                    int sj = M * j1;

                    for (int i1 = i0; i1 < imax; ++i1) {
                        int mi = M * i1;
                        int ki = K * i1;
                        int kij = ki + j1;

                        for (int k1 = k0; k1 < kmax; ++k1) {
                            C[kij] += A[mi + k1] * B[sj + k1];
                        }
                    }
                }
            }
        }
    }
}
```
As for the Cannon reference, the SUMMA algorithm is a better one to follow.

In case anyone else is optimizing tall-skinny multiplications ({~1e9 x 50} x {50 x 50}, how I ended up here), the transposed approach is nearly identical in performance to the blocked approach up to n=18 (floats). n=18 is a pathological case (way worse than 17 or 19) and I don't quite see the cache access patterns that cause this. All larger n are improved with the blocked approach.
0 讨论(0)
发布评论:

提交评论
- 加载中...