I\'m trying to explore the concurrent kernels execution property of my Nvidia Quadro 4000, which has 2.0 capability.
I use 2 different streams, which run the same as
Compute Capability 2.x-3.0
Compute capability 2.x-3.0 devices have a single hardware work queue. The CUDA driver pushes commands into the work queue. The GPU host reads the commands and dispatches the work to the copy engines or the CUDA Work Distributor (CWD). The CUDA driver inserts synchronization commands into the hardware work queue to guarantee that work on the same stream is not able to run concurrently. When the host hits a synchronization command it will stall until the dependent work is completed.
Concurrent kernel execution improves GPU utilization when a grid is too small to fill the entire GPU or when grids have tail effect (subset of thread blocks execute much longer than other thread blocks).
Case 1: Back to back kernels on one stream
If an application launches two kernesl back to back on the same stream the synchronization command inserted by the CUDA driver will not dispatch the 2nd kernel to CWD until the first kernel has completed.
Case 2: Back to back kernel launches on two streams
If an application launches two kernels on different streams the host will reads the commands and dispatch the commands to CWD. CWD will rasterize the first grid (order is architecture dependent) and dispatch thread blocks to the SMs. Only when all of the threads blocks from the first grid have been dispatched will CWD dispatch thread blocks from the second grid.
Compute Capability 3.5
Compute capability 3.5 introduced several new features to improve GPU utilization. These include: - HyperQ supports multiple independent hardware work queues. - Dynamic Parallelism allows for device code to launch new work. - CWD capacity was increased to 32 grids.
Resources