Optimal Local/Global worksizes in OpenCL

问题

I am wondering how to chose optimal local and global work sizes for different devices in OpenCL? Is it any universal rule for AMD, NVIDIA, INTEL GPUs? Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?

Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.

回答1:

NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.

So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:

int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;

This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).

Hope that answered your question. GL HF!

回答2:

If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.

See my answer here : https://stackoverflow.com/a/13762847/145757

If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.

来源：https://stackoverflow.com/questions/14254194/optimal-local-global-worksizes-in-opencl

标签

opencl