CUDA Matrix multiplication breaks for large matrices

I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of "Width" (Matrix width) up to about 2500 or so.

int size = Width*Width*sizeof(float);
float* Md, *Nd, *Pd;
cudaError_t err = cudaSuccess;

//Allocate Device Memory for M, N and P
err = cudaMalloc((void**)&Md, size);
err = cudaMalloc((void**)&Nd, size);
err = cudaMalloc((void**)&Pd, size);

//Copy Matrix from Host Memory to Device Memory
err = cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
err = cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

//Setup the execution configuration
dim3 dimBlock(TileWidth, TileWidth, 1);
dim3 dimGrid(ceil((float)(Width)/TileWidth), ceil((float)(Width)/TileWidth), 1);

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

//Free Device Memory
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);

When I set the "Width" to 3000 or greater, I get the following error after a black screen:

I looked online and I saw that some people has this issue because the watchdog was killing the kernel after it hangs for more than 5 seconds. I tried editing the "TdrDelay" in the registry and this delayed the time before the black screen and same error appeared. So I concluded this was not my issue.

I debugged into my code and found this line to be the culprit:

err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

This is what I use to return my result set from the device after my matrix multiplication kernel function is called. Everything up until this point seems to run fine. I believe I am allocating memory correctly and cannot figure out why this is happening. I thought maybe I didn't have enough memory on my card for this but then shouldn't cudaMalloc have returned an error? (I confirmed it didn't while debugging).

Any ideas/assistance would be greatly appreciated!... Thanks a lot guys!!

Kernel code:

//Matrix Multiplication Kernel - Multi-Block Implementation
__global__ void MatrixMultiplicationMultiBlock_Kernel (float* Md, float* Nd, float* Pd, int Width) 
{
int TileWidth = blockDim.x;

//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + threadIdx.y;
int Column = (TileWidth*blockIdx.x) + threadIdx.x;

//Pvalue store the Pd element that is computed by the thread
float Pvalue = 0;

for (int i = 0; i < Width; ++i)
{
    float Mdelement = Md[Row * Width + i];
    float Ndelement = Nd[i * Width + Column];
    Pvalue += Mdelement * Ndelement;
}

//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}

I also have this other function that uses shared memory, and it also gives the same error:

Call:

            MatrixMultiplicationSharedMemory_Kernel<<<dimGrid, dimBlock, sizeof(float)*TileWidth*TileWidth*2>>>(Md, Nd, Pd, Width);

Kernel code:

 //Matrix Multiplication Kernel - Shared Memory Implementation
 __global__ void MatrixMultiplicationSharedMemory_Kernel (float* Md, float* Nd, float* Pd, int Width) 
 {
int TileWidth = blockDim.x;

//Initialize shared memory
extern __shared__ float sharedArrays[];
float* Mds = (float*) &sharedArrays;
float* Nds = (float*) &Mds[TileWidth*TileWidth];

int tx = threadIdx.x;
int ty = threadIdx.y;

//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + ty;
int Column = (TileWidth*blockIdx.x) + tx;
float Pvalue = 0;

//For each tile, load the element into shared memory
for( int i = 0; i < ceil((float)Width/TileWidth); ++i)
{
    Mds[ty*TileWidth+tx] = Md[Row*Width + (i*TileWidth + tx)];
    Nds[ty*TileWidth+tx] = Nd[(ty + (i * TileWidth))*Width + Column]; 

    __syncthreads();

    for( int j = 0; j < TileWidth; ++j)
    {
        Pvalue += Mds[ty*TileWidth+j] * Nds[j*TileWidth+tx];
    }

    __syncthreads();
}

//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}

Controlling the WDDM Timeout

The problem is actually the kernel not the cudaMemcpy(). When you launch the kernel the GPU goes off and does the work asynchronously with the CPU, so it's only when you synchronize with the GPU that you have to wait for the work to finish. cudaMemcpy() involves an implicit synchronization, hence that is where you see the problem.

You could double-check this by calling cudaThreadSynchronize() after the kernel and the problem will appear to be on the cudaThreadSynchronize() instead of the cudaMemcpy().

After changing the TDR timeout, did you restart your machine? Unfortunately Windows needs to be restarted to change the TDR settings. This Microsoft document has a fairly good description of the full settings available.

Kernel problems

In this case the problem is not actually the WDDM timeout. There are errors in the kernel which you would need to resolve (for example you should be able to incremement i by more than one on each iteration) and checking out the matrixMul sample in the SDK may be useful. Incidentally, I hope this is a learning exercise since in reality you would be better off (for performance) using CUBLAS to perform matrix multiplication.

The most critical problem in the code is that you are using shared memory without actually allocating any. In your kernel you have:

//Initialize shared memory
extern __shared__ float sharedArrays[];

But when you launch the kernel you do not specify how much shared memory to allocate for each block:

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

The <<<>>> syntax actually takes four arguments where the third and fourth are optional. The fourth is the stream index which is used to get overlap between compute and data transfer (and for concurrent kernel execution) but the third argument specifies the amount of shared memory per block. In this case I assume you want to store TileWidth * TileWidth floats in the shared memory, so you would use:

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock, dimBlock.x * dimBlock.x * sizeof(float)>>>(Md, Nd, Pd, Width);

The main problem

As you mention in your comment, the actual problem was that your matrix width was not a multiple of the block width (and height since it is square, meaning the threads beyond the end would access beyond the end of the array. The code should either handle the non-multiple case or it should ensure that the width is a multiple of the block size.

I should have suggested this earlier, but it is often useful to run cuda-memcheck to check for memeory access violations like this.

You have to change the Driver Timeout settings, is windows feature to prevent faulty drivers to make the system unresponsive. Check the Microsoft Page describing how to do that.

You should also check the "timeout" flag setting on your GPU Device. If you have the CUDA SDK installed, I believe the "deviceQuery" app will report this property.

来源：https://stackoverflow.com/questions/4059803/cuda-matrix-multiplication-breaks-for-large-matrices

标签

cuda

gpu

nvidia

gpu-programming