CUDA Matrix multiplication breaks for large matrices

一世执手 提交于 2019-11-30 19:50:09

Controlling the WDDM Timeout

The problem is actually the kernel not the cudaMemcpy(). When you launch the kernel the GPU goes off and does the work asynchronously with the CPU, so it's only when you synchronize with the GPU that you have to wait for the work to finish. cudaMemcpy() involves an implicit synchronization, hence that is where you see the problem.

You could double-check this by calling cudaThreadSynchronize() after the kernel and the problem will appear to be on the cudaThreadSynchronize() instead of the cudaMemcpy().

After changing the TDR timeout, did you restart your machine? Unfortunately Windows needs to be restarted to change the TDR settings. This Microsoft document has a fairly good description of the full settings available.

Kernel problems

In this case the problem is not actually the WDDM timeout. There are errors in the kernel which you would need to resolve (for example you should be able to incremement i by more than one on each iteration) and checking out the matrixMul sample in the SDK may be useful. Incidentally, I hope this is a learning exercise since in reality you would be better off (for performance) using CUBLAS to perform matrix multiplication.

The most critical problem in the code is that you are using shared memory without actually allocating any. In your kernel you have:

//Initialize shared memory
extern __shared__ float sharedArrays[];

But when you launch the kernel you do not specify how much shared memory to allocate for each block:

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

The <<<>>> syntax actually takes four arguments where the third and fourth are optional. The fourth is the stream index which is used to get overlap between compute and data transfer (and for concurrent kernel execution) but the third argument specifies the amount of shared memory per block. In this case I assume you want to store TileWidth * TileWidth floats in the shared memory, so you would use:

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock, dimBlock.x * dimBlock.x * sizeof(float)>>>(Md, Nd, Pd, Width);

The main problem

As you mention in your comment, the actual problem was that your matrix width was not a multiple of the block width (and height since it is square, meaning the threads beyond the end would access beyond the end of the array. The code should either handle the non-multiple case or it should ensure that the width is a multiple of the block size.

I should have suggested this earlier, but it is often useful to run cuda-memcheck to check for memeory access violations like this.

You have to change the Driver Timeout settings, is windows feature to prevent faulty drivers to make the system unresponsive. Check the Microsoft Page describing how to do that.

You should also check the "timeout" flag setting on your GPU Device. If you have the CUDA SDK installed, I believe the "deviceQuery" app will report this property.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!