Difference on creating a CUDA context

问题

I've a program that uses three kernels. In order to get the speedups, I was doing a dummy memory copy to create a context as follows:

__global__ void warmStart(int* f)
{
    *f = 0;
}

which is launched before the kernels I want to time as follows:

int *dFlag = NULL;
cudaMalloc( (void**)&dFlag, sizeof(int) );
warmStart<<<1, 1>>>(dFlag);
Check_CUDA_Error("warmStart kernel");

I also read about other simplest ways to create a context as cudaFree(0) or cudaDevicesynchronize(). But using these API calls gives worse times than using the dummy kernel.

The execution times of the program, after forcing the context, are 0.000031 seconds for the dummy kernel and 0.000064 seconds for both, the cudaDeviceSynchronize() and cudaFree(0). The times were get as a mean of 10 individual executions of the program.

Therefore, the conclusion I've reached is that launch a kernel initialize something that is not initialized when creating a context in the canonical way.

So, what's the difference of creating a context in these two ways, using a kernel and using an API call?

I run the test in a GTX480, using CUDA 4.0 under Linux.

回答1:

Each CUDA context has memory allocations that are required to execute a kernel that are not required to be allocated to syncrhonize, allocate memory, or free memory. The initial allocation of the context memory and resizing of these allocations is deferred until a kernel requires these resources. Examples of these allocations include the local memory buffer, device heap, and printf heap.

来源：https://stackoverflow.com/questions/13313930/difference-on-creating-a-cuda-context

标签

cuda

nvidia

nvcc