How do I know if the kernels are executing concurrently?

前端 未结 3 383
栀梦
栀梦 2020-12-15 14:23

I have a GPU with CC 3.0, so it should support 16 concurrent kernels. I am starting 10 kernels by looping through clEnqueueNDRangeKernel for 10 times. How do I get to know t

相关标签:
3条回答
  • 2020-12-15 15:02

    You can avoid all the boilerplate code suggested in the other answers (which are correct by the way) by using C Framework for OpenCL, which simplifies this task a lot, and gives you detailed information about OpenCL events (kernel execution, data transfers, etc), including a table and a plot dedicated to overlapped execution of said events.

    I developed this library in order to, among other things, simplify the process described in the other answers. You can see a basic usage example here.

    0 讨论(0)
  • 2020-12-15 15:04

    To determine if your kernel executions overlap, you have to profile them. This requires several steps:

    1. Creating the command-queues

    Profiling data is only collected if the command-queue is created with the property CL_QUEUE_PROFILING_ENABLE:

    cl_command_queue queues[10];
    for (int i = 0; i < 10; ++i) {
      queues[i] = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE,
                                       &errcode);
    }
    

    2. Making sure all kernels start at the same time

    You are right in your assumption that the CPU queues the kernels sequentially. However, you can create a single user event and add it to the wait list for all kernels. This causes the kernels not to start running before the user event is completed:

    // Create the user event
    cl_event user_event = clCreateUserEvent(context, &errcode);
    
    // Reserve space for kernel events
    cl_event kernel_events[10];
    
    // Enqueue kernels
    for (int i = 0; i < 10; ++i) {
      clEnqueueNDRangeKernel(queues[i], kernel, work_dim, global_work_offset,
                             global_work_size, 1, &user_event, &kernel_events[i]);
    }
    
    // Start all kernels by completing the user event
    clSetUserEventStatus(user_event, CL_COMPLETE);
    

    3. Obtain profiling times

    Finally, we can collect the timing information for the kernel events:

    // Block until all kernels have run to completion
    clWaitForEvents(10, kernel_events);
    
    for (int i = 0; i < 10; ++i) {
      cl_ulong start;
      clGetEventProfilingInfo(kernel_event[i], CL_PROFILING_COMMAND_START,
                              sizeof(start), &start, NULL);
      cl_ulong end;
      clGetEventProfilingInfo(kernel_event[i], CL_PROFILING_COMMAND_END,
                              sizeof(end), &end, NULL);
      printf("Event %d: start=%llu, end=%llu", i, start, end);
    }
    

    4. Analyzing the output

    Now that you have the start and end times of all kernel runs, you can check for overlaps (either by hand or programmatically). The output units are nanoseconds. Note however that the device timer is only accurate to a certain resolution. You can query the resolution using:

    size_t resolution;
    clGetDeviceInfo(device, CL_DEVICE_PROFILING_TIMER_RESOLUTION,
                    sizeof(resolution), &resolution, NULL);
    

    FWIW, I tried this on a NVIDIA device with CC 2.0 (which should support concurrent kernels) and observed that the kernels were run sequentially.

    0 讨论(0)
  • 2020-12-15 15:23

    Yes, as you suggest, try to use the events, and analyze all the QUEUED, SUBMIT, START, END values. These should be absolute values in "device time", and you may be able to see if processing (START to END) overlaps for the different kernels.

    0 讨论(0)
提交回复
热议问题