loop tiling/blocking for large dense matrix multiplication

前端 未结 1 1276
野性不改
野性不改 2020-12-30 03:11

I was wondering if someone could show me how to use loop tiling/loop blocking for large dense matrix multiplication effectively. I am doing C = AB

1条回答
  •  执笔经年
    2020-12-30 03:32

    The best results I've gotten are by adding one more for loop that blocks over your N, and by rearranging the loops. I also hoisted loop-invariant code, but the compiler's optimizer should hopefully do this automatically. The block size should be the cache line size divided by sizeof(float). This got it ~50% faster than the transposed approach.

    If you have to pick just one of AVX or blocking, using AVX extensions (vfmadd###ps and haddps) is still substantially faster. Using both is best and straightforward to add given that you're already testing if the block size is a multiple of 64 / sizeof(float) == 16 floats == two 256-bit AVX registers.

    • Transposed: 1,816,522 ticks
    • Tiling: 892,431 (51% faster)
    • AVX+tiling: 230,512 (87% faster)

    Tiling:

    void matrix_mult_wiki_block(const float*A , const float* B, float* C,
                                const int N, const int M, const int K) {
        const int block_size = 64 / sizeof(float); // 64 = common cache line size
        for(int i=0; i N ? N : i0 + block_size;
    
            for (int j0 = 0; j0 < M; j0 += block_size) {
                int jmax = j0 + block_size > M ? M : j0 + block_size;
    
                for (int k0 = 0; k0 < K; k0 += block_size) {
                    int kmax = k0 + block_size > K ? K : k0 + block_size;
    
                    for (int j1 = j0; j1 < jmax; ++j1) {
                        int sj = M * j1;
    
                        for (int i1 = i0; i1 < imax; ++i1) {
                            int mi = M * i1;
                            int ki = K * i1;
                            int kij = ki + j1;
    
                            for (int k1 = k0; k1 < kmax; ++k1) {
                                C[kij] += A[mi + k1] * B[sj + k1];
                            }
                        }
                    }
                }
            }
        }
    }
    

    As for the Cannon reference, the SUMMA algorithm is a better one to follow.


    In case anyone else is optimizing tall-skinny multiplications ({~1e9 x 50} x {50 x 50}, how I ended up here), the transposed approach is nearly identical in performance to the blocked approach up to n=18 (floats). n=18 is a pathological case (way worse than 17 or 19) and I don't quite see the cache access patterns that cause this. All larger n are improved with the blocked approach.

    0 讨论(0)
提交回复
热议问题