What is the algorithm to determine optimal work group size and number of workgroup

情到浓时终转凉″ 提交于 2019-11-26 16:28:08

问题


OpenCL standard defines the following options to get info about device and compiled kernel:

  • CL_DEVICE_MAX_COMPUTE_UNITS

  • CL_DEVICE_MAX_WORK_GROUP_SIZE

  • CL_KERNEL_WORK_GROUP_SIZE

  • CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

Given this values, how can I calculate the optimal size of work group and number of work groups?


回答1:


You discover these values experimentally for your algorithm. Use a profiler to get hard numbers.

I like to use CL_DEVICE_MAX_COMPUTE_UNITS as the number of work groups, because I often rely on synchronizing work items. I usually run kernels with little branching, so the take the same time to execute in each compute unit.

Some multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will be optimal for your device. What that multiple actually is depends on your memory access pattern and type of work you are doing with each work item. Use 1 as the multiple when you are running a heavy, compute-bound (ALU) kernel. Try a larger multiple to hide memory latency if you are bottlenecked by memory access. Use a profiler to determine when your access time and your ALU time are optimal.

Optimal ratio for ALU to fetch is 1:1 for any device. This is rarely achieved in practice, so you want to keep the ALU/SIMD banks saturated. This means ALU:fetch should be greater than 1 whenever possible. Less than 1 means you should try a larger work group size to better hide the memory latency.




回答2:


As mfa said, you have to discover these experimentally. I wanted to add that depending on what you are computing (particularly size of the jobs, i.e. smaller or larger for each work item), sometimes a good try can be:

  • Lots of work items with small work groups and each job item being small.
  • Less work items with larger work groups and each job item being larger.

That is, basically check base cases and figure out how it affects the processing pipeline.

In essence you have to tweak it. I often execute several times for different parameters (profile it) and then generate a surface plot to see how it behaves.



来源:https://stackoverflow.com/questions/10096443/what-is-the-algorithm-to-determine-optimal-work-group-size-and-number-of-workgro

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!