OpenCL - How to I query for a device's SIMD width?

主宰稳场 提交于 2019-11-30 08:05:19

I found the answer I was looking for. It turns out that you don't query the device for this information, you query the kernel object (in OpenCL). My source is:

http://www.hpc.lsu.edu/training/tutorials/sc10/tutorials/SC10Tutorials/docs/M13/M13.pdf

(Page 108)

which says:

The most efficient work group sizes are likely to be multiples of the native hardware execution width

  • wavefront size in AMD speak/warp size in Nvidia speak
  • Query device for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

So, in short, the answer appears to be to call the clGetKernelWorkGroupInfo() method with a param name of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. See this link for more information on this method:

http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetKernelWorkGroupInfo.html

On AMD, you can query CL_DEVICE_WAVEFRONT_WIDTH_AMD. That's different from CL_DEVICE_SIMD_WIDTH_AMD, which returns the number of threads it executes in each clock cycle. The latter may be smaller than the wavefront size, in which case it takes multiple clock cycles to execute one instruction for all the threads in a wavefront.

On NVIDIA, you can query the warp size width using clGetDeviceInfo with CL_DEVICE_WARP_SIZE_NV (although this is always 32 for current GPUs), however, this is an extension, as OpenCL defines nothing like warps or wavefronts. I don't know about any AMD extension that would allow to query for the wavefront size.

For AMD: clGetDeviceInfo(..., CL_DEVICE_WAVEFRONT_WIDTH_AMD, ...) (if cl_amd_device_attribute_query extension supported)

For Nvidia: clGetDeviceInfo(..., CL_DEVICE_WARP_SIZE_NV, ...) (if cl_nv_device_attribute_query extension supported)

But there is no uniform way. The way suggested by Jonathan DeCarlo doesn't work, I was using it for GPUs if these two extensions does not supported - for example Intel iGPU, but recently I faced wrong results on Intel HD 4600:

Intel HD 4600 says CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE=32 while in fact Intel GPUs seems to have wavefront equal to 16, so I faced incorrect results, everything works fine if barriers were used for wavefront=16.

P.S. I have not enough reputation to comment Jonathan DeCarlo answer about this, will be glad if somebody will add comment.

You can use the clGetDeviceInfo to get maximum number of workitems you can have in your local workset for each dimension. This is most likely multiple of your wavefront size.

See: http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetDeviceInfo.html

For CUDA (using NVIDIA), please take a look at B.4.5 Cuda programming guide from NVIDIA. There is a variable for containing this information. You can query this variable at runtime. For AMD , I'm not sure if there is such a variable.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!