Non Square Matrix Multiplication in CUDA

。_饼干妹妹 提交于 2019-12-02 02:26:26

I think the easiest thing to do would be to just pad the blocks on the end with zeros:

for(int m=0; m< uWM/blocksize;++m){
    colM = m*blocksize+tx;
    rowN = m*blocksize+ty;
    if (rowM > uWN || rowN > uWM || colM > uWM || colN > uWN) {
        MS[ty][tx]=0.;
        NS[ty][tx]=0.;
    } else {
        MS[ty][tx]=M[rowM*uWM+colM];
        NS[ty][tx]=N[colN + uWN*rowN];
    }

plus or minus. (That NS line should reference N, not M, right?)

But, since I seem to be the only one here advocating using existing tuned libraries when possible -- why not use CUBLAS or MAGMA instead of rolling your own? They're fast, and tested by hundreds of users.

The underlying performance requirement here is that either the first or second dimension of the shared memory "tile" be a round multiple of 16 - historically that is what is necessary to achieve optimal global memory bandwidth (ie. half warp coalesced transactions). Whether it should be the first or second dimension of the tile is dictated by whether the matrices are stored in column or row major order. There is nothing to say that the shared memory tile need be square, only that the leading dimension of the storage (LDA in BLAS notation) be round multiples of 16.

You could easily template the kernel with the tile dimensions as template arguments and instantiate several versions, depending on matrix dimensions. For a given architecture, there is probably an optimal tile dimension which balances occupancy and instruction level parallelism. The "clever" way to solve this is probably to decompose the matrix multiplication into two operations - the first doing the bulk of the work at the optimal tile size, and the second at a different size for the remaining columns. If the result is going straight back to host memory after the product is completed, the second operation might best be done on the host using an optimised BLAS, overlapped with the GPU kernel. This is the approach that many of the routines in the UTK Magma library use.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!