clock() in opencl

≡放荡痞女 提交于 2019-12-05 13:52:47

The NVIDIA OpenCL SDK has an example Using Inline PTX with OpenCL. The clock register is accessible through inline PTX as the special register %clock. %clock is described in PTX: Parallel Thread Execution ISA manual. You should be able to replace the %%laneid with %%clock.

I have never tested this with OpenCL but use it in CUDA.

Please be warned that the compiler may reorder or remove the register read.

There is no OpenCL way to query clock cycles directly. However, OpenCL does have a profiling mechanism that exposes incremental counters on compute devices. By comparing the differences between ordered events, elapsed times can be measured. See clGetEventProfilingInfo.

Just for others coming her for help: Short introduction to profiling kernel runtime with OpenCL

Enable profiling mode:

cmdQueue = clCreateCommandQueue(context, *devices, CL_QUEUE_PROFILING_ENABLE, &err);

Profiling kernel:

cl_event prof_event; 
clEnqueueNDRangeKernel(cmdQueue, kernel, 1 , 0, globalWorkSize, NULL, 0, NULL, &prof_event);

Read profiling data in:

cl_ulong ev_start_time=(cl_ulong)0;     
cl_ulong ev_end_time=(cl_ulong)0;   

clFinish(cmdQueue);
err = clWaitForEvents(1, &prof_event);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &ev_start_time, NULL);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &ev_end_time, NULL);

Calculate kernel execution time:

float run_time_gpu = (float)(ev_end_time - ev_start_time)/1000; // in usec

Profiling of individual work-items / work-goups is NOT possible yet. You can set globalWorkSize = localWorkSize for profiling. Then you have only one workgroup.

Btw: Profiling of a single work-item (some work-items) isn't very helpful. With only some work-items you won't be able to hide memory latencies and the overhead leading to not meaningful measurements.

Try this (Only work with NVidia OpenCL of course) :

uint clock_time()
{
    uint clock_time;
    asm("mov.u32 %0, %%clock;" : "=r"(clock_time));
    return clock_time;
}

On NVIDIA you can use the following:

typedef unsigned long uint64_t; // if you haven't done so earlier
inline uint64_t n_nv_Clock()
{
    uint64_t n_clock;
    asm volatile("mov.u64 %0, %%clock64;" : "=l" (n_clock)); // make sure the compiler will not reorder this
    return n_clock;
}

The volatile keyword tells the optimizer that you really mean it and don't want it moved / optimized away. This is a standard way of doing so both in PTX and e.g. in gcc.

Note that this returns clocks, not nanoseconds. You need to query for device clock frequency (using clGetDeviceInfo(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, 0))). Also note that on older devices there are two frequencies (or three if you count the memory frequency which is irrelevant in this case): the device clock and the shader clock. What you want is the shader clock.

With the 64-bit version of the register you don't need to worry about overflowing as it generally takes hundreds of years. On the other hand, the 32-bit version can overflow quite often (you can still recover the result - unless it overflows twice).

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!