CUDA: What is the threads per multiprocessor and threads per block distinction? [duplicate]

∥☆過路亽.° 提交于 2019-11-30 05:27:04

Given that only one block can be executed on each SM at a time,

This statement is fundamentally incorrect. Barring resource conflicts, and assuming enough threadblocks in a kernel (i.e. the grid), an SM will generally have multiple threadblocks assigned to it.

The basic unit of execution is the warp. A warp consists of 32 threads, executed together in lockstep by an SM, on an instruction-cycle by instruction-cycle basis.

Therefore, even within a single threadblock, an SM will generally have more than a single warp "in flight". This is essential for good performance to allow the machine to hide latency.

There is no conceptual difference between choosing warps from the same threadblock to execute, or warps from different threadblocks. SMs can have multiple threadblocks resident on them (i.e. with resources such as registers and shared memory assigned to each resident threadblock), and the warp scheduler will choose from amongst all the warps in all the resident threadblocks, to select the next warp for execution on any given instruction cycle.

Therefore, the SM has a greater number of threads that can be "resident" because it can support more than a single block, even if that block is maximally configured with threads (512, in this case). We utilize more than the threadblock limit by having multiple threadblocks resident.

You may also want to research the idea of occupancy in GPU programs.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!