cuda shared memory and block execution scheduling

问题

I would like to clear up an execution state with CUDA shared memory and block execution based on the amount of shared memory used per block.

State

I target on GTX480 nvidia card which has 48KB shared memory per block and 15 streaming multiprocessors. So, if i declare a kernel with 15 blocks, each one uses 48KB of shared memory and no other restriction is reached (registers, maximum threads per block etc.) every block is running into one SM(of 15) until the end. In this case is needed only scheduling between warps of the same block.

Question

So, my misunderstanding scenario is:
I call a kernel with 30 blocks so that 2 blocks reside on each SM. Now scheduler on each SM have to deal with warps from different blocks. But only when one block finishes its execution, warps of the other block is executed on SM because of shared memory entire amount (48KB per SM) usage. If this doesn't happen and warps of different blocks scheduling for execution on the same SM the result may be wrong because one block can read values loaded from the other in shared memory. Am i right?

回答1:

You don't need to worry about this. As you have correctly said, if only one block fits per SM because of the amount of shared memory used, only one block will be scheduled at any one time. So there is no chance of memory corruption caused by overcommitting shared memory.

BTW for performance reasons it is usually better to have at least two blocks running per SM because

during __syncthreads() the SM may idle unnecessary as fewer and fewer warps from the block may still be runnable.
warps of the same block tend to run tightly coupled, so there may be times when all warps wait for memory and other times when all warps perform computations. With more blocks this may even out better, resulting in better ressource utilization overall.

Of course there may be reasons why more shared memory per block gives a larger speedup than running multiple blocks per SM would.

来源：https://stackoverflow.com/questions/12651939/cuda-shared-memory-and-block-execution-scheduling

标签

cuda

shared-memory

warp-scheduler