I get what blockDim
is, but I have a problem with gridDim. Blockdim
gives the size of the block, but what is gridDim
? On the Internet
Paraphrased from the CUDA Programming Guide:
gridDim: This variable contains the dimensions of the grid.
blockIdx: This variable contains the block index within the grid.
blockDim: This variable and contains the dimensions of the block.
threadIdx: This variable contains the thread index within the block.
You seem to be a bit confused about the thread hierachy that CUDA has; in a nutshell, for a kernel there will be 1 grid, (which I always visualize as a 3-dimensional cube). Each of its elements is a block, such that a grid declared as dim3 grid(10, 10, 2);
would have 10*10*2 total blocks. In turn, each block is a 3-dimensional cube of threads.
With that said, it's common to only use the x-dimension of the blocks and grids, which is what it looks like the code in your question is doing. This is especially revlevant if you're working with 1D arrays. In that case, your tid+=blockDim.x * gridDim.x
line would in effect be the unique index of each thread within your grid. This is because your blockDim.x
would be the size of each block, and your gridDim.x
would be the total number of blocks.
So if you launch a kernel with parameters
dim3 block_dim(128,1,1);
dim3 grid_dim(10,1,1);
kernel<<>>(...);
then in your kernel had threadIdx.x + blockIdx.x*blockDim.x
you would effectively have:
threadIdx.x range from [0 ~ 128)
blockIdx.x range from [0 ~ 10)
blockDim.x equal to 128
gridDim.x equal to 10
Hence in calculating threadIdx.x + blockIdx.x*blockDim.x
, you would have values within the range defined by: [0, 128) + 128 * [1, 10)
, which would mean your tid values would range from {0, 1, 2, ..., 1279}.
This is useful for when you want to map threads to tasks, as this provides a unique identifier for all of your threads in your kernel.
However, if you have
int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid += blockDim.x * gridDim.x;
then you'll essentially have: tid = [0, 128) + 128 * [1, 10) + (128 * 10)
, and your tid values would range from {1280, 1281, ..., 2559}
I'm not sure where that would be relevant, but it all depends on your application and how you map your threads to your data. This mapping is pretty central to any kernel launch, and you're the one who determines how it should be done. When you launch your kernel you specify the grid and block dimensions, and you're the one who has to enforce the mapping to your data inside your kernel. As long as you don't exceed your hardware limits (for modern cards, you can have a maximum of 2^10 threads per block and 2^16 - 1 blocks per grid)