How to reduce CUDA synchronize latency / delay

不想你离开。 提交于 2019-12-07 02:47:05

问题


This question is related to using cuda streams to run many kernels

In CUDA there are many synchronization commands cudaStreamSynchronize, CudaDeviceSynchronize, cudaThreadSynchronize, and also cudaStreamQuery to check if streams are empty.

I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible.

Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an application and two of them need to complete for me to launch a forth streams should i use 2 cudaStreamSyncs or just one cudaDeviceSync what will incur less loss ?


回答1:


The main difference between synchronize methods is "polling" and "blocking."

"Polling" is the default mechanism for the driver to wait for the GPU - it waits for a 32-bit memory location to attain a certain value written by the GPU. It may return the wait more quickly after the wait is resolved, but while waiting, it burns a CPU core looking at that memory location.

"Blocking" can be requested by calling cudaSetDeviceFlags() with cudaDeviceScheduleBlockingSync, or calling cudaEventCreate() with cudaEventBlockingSync. Blocking waits cause the driver to insert a command into the DMA command buffer that signals an interrupt when all preceding commands in the buffer have been executed. The driver can then map the interrupt to a Windows event or a Linux file handle, enabling the synchronization commands to wait without constantly burning CPU, as do the default polling methods.

The queries are basically a manual check of that 32-bit memory location used for polling waits; so in most situations, they are very cheap. But if ECC is enabled, the query will dive into kernel mode to check if there are any ECC errors; and on Windows, any pending commands will be flushed to the driver (which requires a kernel thunk).



来源:https://stackoverflow.com/questions/11953722/how-to-reduce-cuda-synchronize-latency-delay

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!