slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

江枫思渺然 提交于 2019-11-28 02:17:07

1, Why the K40 takes a much longer time (1100ms vs 150ms) for the 1st cudaMalloc? As K40 is supposed to be better than K20

The details of the initialization process are not specified, however by observation the amount of system memory affects initialization time. CUDA initialization usually includes establishment of UVM, which involves harmonizing of device and host memory maps. If your server has more system memory than your PC, it is one possible explanation for the disparity in initialization time. The OS may have an effect as well, finally the memory size of the GPU may have an effect.

2, I thought "cudaSetDevice" can capture the Init time? e.g. This Answer from talonmies

The CUDA initialization process is a "lazy" initialization. That means that just enough of the initialization process will be completed in order to support the requested operation. If the requested operation is cudaSetDevice, this may require less of the initialization to be complete (which means the apparent time required may be shorter) than if the requested operation is cudaMalloc. That means that some of the initialization overhead may be absorbed into the cudaSetDevice operation, while some additional initialization overhead may be absorbed into a subsequent cudaMalloc operation.

3, If the initialization is unavoidable, can process A maintain its status(or context) in GPU while process B is running in the same GPU? I understand I better run GPU in "exclusive" mode, but can process A "suspend" so that it doesn't need to initialize GPU again later?

Independent host processes will generally spawn independent CUDA contexts. A CUDA context has the initialization requirement associated with it, so the fact that another, separate cuda context may be already initialized on the device will not provide much benefit if a new CUDA context needs to be initialized (perhaps from a separate host process). Normally, keeping a process active involves keeping an application running in that process. Applications have various mechanisms to "sleep" or suspend behavior. As long as the application has not terminated, any context established by that application should not require re-initialization (excepting, perhaps, if cudaDeviceReset is called).

In general, some benefit may be obtained on systems that allow the GPUs to go into a deep idle mode by setting GPU persistence mode (using nvidia-smi). However this will not be relevant for GeForce GPUs nor will it be generally relevant on a windows system.

Additionally, on multi-GPU systems, if the application does not need multiple GPUs, some initialization time can usually be avoided by using the CUDA_VISIBLE_DEVICES environment variable, to restrict the CUDA runtime to only use the necessary devices.

Depending on the target architecture that the code is compiled for and the architecture that is running the code, the JIT compilation can kick in with the first cudaMalloc (or any other) call. "If binary code is not found but PTX is available, then the driver compiles the PTX code." Some more details:

http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!