Large matrix multiplication on gpu

问题

I need to implement a matrix multiplication on GPU with CUDA for large matrices. Size of each matrix alone is bigger than the GPU memory. So I think I need an algorithm to do that efficiently. I went around the internet but couldn't find any. Can anyone give me the name or link of such algorithms.

Thank you

回答1:

There isn't really a formal algorithm for this; in general, these sorts of linear algebra operations where the whole problem isn't stored in memory simultaneously are referred to as "out of core" operations.

To solve it, you don't need a particularly elaborate algorithm, just the CUBLAS library and a pencil and paper. For example, you can decompose the matrix product like this:

which gives you four independent sub-matrix multiplication operations. These can be calculated using four calls to CUBLAS gemm using very straightforward host code. You can extend the idea to as many sub-matrices as are required to match the problem size and your GPU capacity. The same principle can also be used to implement matrix multiplication problems on multiple GPUs (see this question for an example).

In the alternative, you can find a working implementation of this precise idea in the Harvard developed SciGPU-GEMM codebase and in the HPL-CUDA linpack implementation (disclaimer: I am affiliated with the latter codebase).

来源：https://stackoverflow.com/questions/14557067/large-matrix-multiplication-on-gpu

标签

cuda

gpu

gpgpu

matrix-multiplication

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!