Calculation on GPU leads to driver error “stopped responding”

北城以北 提交于 2019-11-28 02:10:42
talonmies

The key piece of information is this part of the gpuDevice output:

KernelExecutionTimeout: 1

This means that the host display driver is active on the GPU you are running the compute jobs on. The NVIDIA display driver contains a watchdog timer which kills any task which takes more than a predefined amount of time without yielding control back to the driver for screen refresh. This is intended to prevent the situation where a long running or stuck compute job renders the machine unresponsive by freezing the display. The runtime of your Matlab script is clearly exceeding the display driver watchdog timer limit. Once that happens, the the compute context held on the device is destroyed and Matlab can no longer operate with the device. You might be able to reinitialise the context by calling reset, which I guess will run cudaDeviceReset() under the cover.

There is a lot of information about this watchdog timer on the interweb - for example this Stack Overflow question. The solution for how to modify this timeout is dependent on your OS and hardware. The simplest way to avoid this is to not run CUDA code on a display GPU, or increase the granularity of your compute jobs so that no one operation has a runtime which exceeds the timeout limit. Or just write faster code...

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!