A question about the details about the distribution from blocks to SMs in CUDA

泪湿孤枕 提交于 2019-12-06 11:58:17

On the GT200, it has been demonstrated through micro-benchmarking that new blocks are scheduled whenever a SM has retired all the currently active blocks which it was running. So the answer is when some blocks are finished, and the scheduling granularity is SM level. There seems to be a consensus that Fermi GPUs have a finer scheduling granularity than previous generations of hardware.

I can't find any reference about this for compute capabilities < 1.3.

Fermi architectures introduce a new block dispatcher called GigaThread engine.
GigaThread enables immediate replacement of blocks on an SM when one completes executing and also enables concurrent kernel execution.

While there is no official answer to this, you can measure through atomic operations when your blocks begin your work and when they end.

Try playing with the following code:

#include <stdio.h>

const int maxBlocks=60; //Number of blocks of size 512 threads on current device required to achieve full occupancy

__global__ void emptyKernel() {}


__global__ void myKernel(int *control, int *output) {
        if (threadIdx.x==1) {
                //register that we enter
                int enter=atomicAdd(control,1);
                output[blockIdx.x]=enter;

                //some intensive and long task
                int &var=output[blockIdx.x+gridDim.x]; //var references global memory
                var=1;
                for (int i=0; i<12345678; ++i) {
                        var+=1+tanhf(var);
                }

                //register that we quit
                var=atomicAdd(control,1);
        }
}


int main() {

        int *gpuControl;
        cudaMalloc((void**)&gpuControl, sizeof(int));
        int cpuControl=0;
        cudaMemcpy(gpuControl,&cpuControl,sizeof(int),cudaMemcpyHostToDevice);


        int *gpuOutput;
        cudaMalloc((void**)&gpuOutput, sizeof(int)*maxBlocks*2);
        int cpuOutput[maxBlocks*2];

        for (int i=0; i<maxBlocks*2; ++i) //clear the host array just to be on the safe side
                cpuOutput[i]=-1;

        // play with these values
        const int thr=479;
        const int p=13;
        const int q=maxBlocks;

        //I found that this may actually affect the scheduler! Try with and without this call.
        emptyKernel<<<p,thr>>>();

        cudaEvent_t timerStart;
        cudaEvent_t timerStop;
        cudaEventCreate(&timerStart);
        cudaEventCreate(&timerStop);

        cudaThreadSynchronize();

        cudaEventRecord(timerStart,0);

        myKernel<<<q,512>>>(gpuControl, gpuOutput);

        cudaEventRecord(timerStop,0);
        cudaEventSynchronize(timerStop);

        cudaMemcpy(cpuOutput,gpuOutput,sizeof(int)*maxBlocks*2,cudaMemcpyDeviceToHost);

        cudaThreadSynchronize();
        float thisTime;
        cudaEventElapsedTime(&thisTime,timerStart,timerStop);

        cudaEventDestroy(timerStart);
        cudaEventDestroy(timerStop);
        printf("Elapsed time: %f\n",thisTime);

        for (int i=0; i<q; ++i)
                printf("%d: %d-%d\n",i,cpuOutput[i],cpuOutput[i+q]);
}

What you get in the output is the block ID, followed by the enter "time" and exit "time". This way you can learn in which order those events occured.

On Fermi, I'm sure that a block is scheduled on a SM as soon there is room for it. I.e., whenever, a SM finishes executing one block, it will execute another block if there is any block left. (However, the actual order is not deterministic).

In older versions, I don't know. But you can verify it by using the build-in clock() function.

For example, I used the following OpenCL kernel code (you can easily convert it to CUDA):

   __kernel void test(uint* start, uint* end, float* buffer);
   {
       int id = get_global_id(0);
       start[id] = clock();
       __do_something_here;
       end[id] = clock();
   }

Then output it to a file and build a graph. You will see how visual it is.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!