OpenCL performance measurement

问题

What is the most appropriate method to present a performance of OpenCL application (especially computing kernels)? I have implemented some algorithms and I was thinking about presenting speed-up and efficiency charts, but according to the definition I need to know how many processors I have used in calculations. In case of OpenCL it can not be done.

回答1:

Create your command queue with the CL_QUEUE_PROFILING_ENABLE flag set, then use clGetEventProfilingInfo to extract timing data. See Chapter 9 of "OpenCL Programming Guide" for more details.

回答2:

I'd say that vocaro's suggestion is the most CL-appropriate, since it leverages features of the language/runtime to do what you want. However, if for some reason that doesn't work for you on your platform, there is another solution if you are only interested in wall-clock execution time of a given CL operation.

You can wrap the operation with clFinish() and use your system's highest resolution timer to get the elapsed time. Something like this, using Mac OS X as an example:

uint64_t start, end;

clFinish(command_queue);
start = mach_absolute_time();
clEnqueueNDRangeKernel(command_queue, /* etc. */ );
clFinish(command_queue);
end = mach_absolute_time();

You can use the information in Apple QA1398 to convert this absolute times to nanoseconds. Note that this method isn't as accurate as using event profiling, since it includes the overhead of clEnqueuNDRangeKernel and clFinish.

The call to finish guarantees that all pending CL commands have been both submitted to the compute device and have completed.

回答3:

nVidia's Best practices guide has a whole chapter devoted to performance measurements. Shortly it boils down to this: you can either use external timer (as proposed by @James), or use GPUs profiling mechanisms (proposed by @vocaro). Latter should offer better precision, though I personally stick to using CPU timer for the sake of simplicity.

according to the definition I need to know how many processors I have used in calculations

This is true for multi-CPU parallelization, when number of processors used is directly controlled by user. This is not the case with GPU: you can use GPU, but you can not control scheduling inside device. So usually (actually, on all CPU-vs-GPU charts I've ever seen) there is either "SpeedUp(problem dimension)" (for "marketing" presentations) or "SpeedUp(kernel options)" (for more "techie" presentation; kernel options might be both grid parameters or some code particularities) or "SpeedUp(number of GPUs used)" (when your program supports multi-GPU, of course).

回答4:

I would believe that some GPUs don't have the hardware device to measure precisely the time; so this means that you might need to go back to the CPU. But I may be wrong.

来源：https://stackoverflow.com/questions/7980090/opencl-performance-measurement

标签

performance

opencl