clock() in opencl | 易学教程

问题

I know that there is function clock() in CUDA where you can put in kernel code and query the GPU time. But I wonder if such a thing exists in OpenCL? Is there any way to query the GPU time in OpenCL? (I'm using NVIDIA's tool kit).

回答1:

The NVIDIA OpenCL SDK has an example Using Inline PTX with OpenCL. The clock register is accessible through inline PTX as the special register %clock. %clock is described in PTX: Parallel Thread Execution ISA manual. You should be able to replace the %%laneid with %%clock.

I have never tested this with OpenCL but use it in CUDA.

Please be warned that the compiler may reorder or remove the register read.

回答2:

There is no OpenCL way to query clock cycles directly. However, OpenCL does have a profiling mechanism that exposes incremental counters on compute devices. By comparing the differences between ordered events, elapsed times can be measured. See clGetEventProfilingInfo.

回答3:

Just for others coming her for help: Short introduction to profiling kernel runtime with OpenCL

Enable profiling mode:

cmdQueue = clCreateCommandQueue(context, *devices, CL_QUEUE_PROFILING_ENABLE, &err);

Profiling kernel:

cl_event prof_event; 
clEnqueueNDRangeKernel(cmdQueue, kernel, 1 , 0, globalWorkSize, NULL, 0, NULL, &prof_event);

Read profiling data in:

cl_ulong ev_start_time=(cl_ulong)0;     
cl_ulong ev_end_time=(cl_ulong)0;   

clFinish(cmdQueue);
err = clWaitForEvents(1, &prof_event);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &ev_start_time, NULL);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &ev_end_time, NULL);

Calculate kernel execution time:

float run_time_gpu = (float)(ev_end_time - ev_start_time)/1000; // in usec

Profiling of individual work-items / work-goups is NOT possible yet. You can set globalWorkSize = localWorkSize for profiling. Then you have only one workgroup.

Btw: Profiling of a single work-item (some work-items) isn't very helpful. With only some work-items you won't be able to hide memory latencies and the overhead leading to not meaningful measurements.

回答4:

Try this (Only work with NVidia OpenCL of course) :

uint clock_time()
{
    uint clock_time;
    asm("mov.u32 %0, %%clock;" : "=r"(clock_time));
    return clock_time;
}

回答5:

On NVIDIA you can use the following:

typedef unsigned long uint64_t; // if you haven't done so earlier
inline uint64_t n_nv_Clock()
{
    uint64_t n_clock;
    asm volatile("mov.u64 %0, %%clock64;" : "=l" (n_clock)); // make sure the compiler will not reorder this
    return n_clock;
}

The volatile keyword tells the optimizer that you really mean it and don't want it moved / optimized away. This is a standard way of doing so both in PTX and e.g. in gcc.

Note that this returns clocks, not nanoseconds. You need to query for device clock frequency (using clGetDeviceInfo(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, 0))). Also note that on older devices there are two frequencies (or three if you count the memory frequency which is irrelevant in this case): the device clock and the shader clock. What you want is the shader clock.

With the 64-bit version of the register you don't need to worry about overflowing as it generally takes hundreds of years. On the other hand, the 32-bit version can overflow quite often (you can still recover the result - unless it overflows twice).

来源：https://stackoverflow.com/questions/8849486/clock-in-opencl

标签

cuda

opencl

gpu