Difference on creating a CUDA context

烈酒焚心 提交于 2019-12-02 01:07:58

问题


I've a program that uses three kernels. In order to get the speedups, I was doing a dummy memory copy to create a context as follows:

__global__ void warmStart(int* f)
{
    *f = 0;
}

which is launched before the kernels I want to time as follows:

int *dFlag = NULL;
cudaMalloc( (void**)&dFlag, sizeof(int) );
warmStart<<<1, 1>>>(dFlag);
Check_CUDA_Error("warmStart kernel");

I also read about other simplest ways to create a context as cudaFree(0) or cudaDevicesynchronize(). But using these API calls gives worse times than using the dummy kernel.

The execution times of the program, after forcing the context, are 0.000031 seconds for the dummy kernel and 0.000064 seconds for both, the cudaDeviceSynchronize() and cudaFree(0). The times were get as a mean of 10 individual executions of the program.

Therefore, the conclusion I've reached is that launch a kernel initialize something that is not initialized when creating a context in the canonical way.

So, what's the difference of creating a context in these two ways, using a kernel and using an API call?

I run the test in a GTX480, using CUDA 4.0 under Linux.


回答1:


Each CUDA context has memory allocations that are required to execute a kernel that are not required to be allocated to syncrhonize, allocate memory, or free memory. The initial allocation of the context memory and resizing of these allocations is deferred until a kernel requires these resources. Examples of these allocations include the local memory buffer, device heap, and printf heap.



来源:https://stackoverflow.com/questions/13313930/difference-on-creating-a-cuda-context

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!