Variable in OpenCL kernel 'for-loop' reduces performance

我是研究僧i 提交于 2019-12-05 23:08:03

Yes, the most likely cause of the performance degradation is that the compiler can't unroll the loop. There's a few things you could try to improve the situation.


You could define the parameter as a preprocessor macro passed via your program build options. This is a common trick used to build values that are only known at runtime into kernels as compile-time constants. For example:

clBuildProgram(program, 1, &device, "-Dnum_loops=50000", NULL, NULL);

You could construct the build options dynamically using sprintf to make this more flexible. Clearly this will only be worth it if you don't need to change the parameter often, so that the overhead of recompilation doesn't become a problem.


You could investigate whether your OpenCL platform uses any pragmas that can give the compiler hints about loop-unrolling. For example, some OpenCL compilers recognise #pragma unroll (or similar). OpenCL 2.0 has an attribute for this: __attribute__((opencl_unroll_hint)).


You could manually unroll the loop. How this would look depends on what assumptions you can make about the num_loops parameter. For example, if you know (or can ensure) that it will always be a multiple of 4, you could do something like this:

for (int kk = 0; kk < num_loops;)
{
  <... more code here ...>
  kk++;
  <... more code here ...>
  kk++;
  <... more code here ...>
  kk++;
  <... more code here ...>
  kk++;
}

Even if you can't make such assumptions, you should still be able to perform manual unrolling, but it may require some extra work (for example, to finish any remaining iterations).

The for loop evaluates the second statement in the (;;) repeatedly to determine if to continue the loop. Such conditional statements always cause control-flow to fork and discard unneeded computations, which is wasteful.

The correct way to do it, is to add another dimension to your kernel, and make that dimension entirely within one work-group so that it'll be executed sequentially inside one computation-unit.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!