unroll loops in an AMD OpenCL kernel

断了今生、忘了曾经 提交于 2019-12-24 14:18:31

问题


I'm trying to assess the performance differences between OpenCL for AMD .I have kernel for hough transfrom in the kernel i have two #pragma unroll statements when run the kernel not produce any speedup

kernel void hough_circle(read_only image2d_t imageIn, global int* in,const int w_hough,__global int * circle)
{
     sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
     int gid0 = get_global_id(0);
     int gid1 = get_global_id(1);


     uint4 pixel;
     int x0=0,y0=0,r;
     int maxval=0;
     pixel=read_imageui(imageIn,sampler,(int2)(gid0,gid1));
     if(pixel.x==255)
     {
      #pragma unroll 20
       for(int r=90;r<110;r+=1)
     {
       //  int r=190;

                #pragma unroll 360
               for(int theta=0; theta<360;theta++)
              {

 x0=(int) round(gid0-r*cos( (float) radians( (float) theta) ));
 y0=(int) round(gid1-r*sin( (float) radians( (float) theta) ));


   // if((x0>0) && (x0<get_global_size(0)) && (y0>0)&&(y0<get_global_size(1)))
 //in[w_hough*y0+x0]++;
   }
              }

     }
     }

does #pragma unroll work with AMD OpenCL any help


回答1:


Unrolling does work with AMD.

http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/

That tool includes kernelanalyzer which allows you to see the actual output of their compiler. I used that to verify that the unrolling actually does produce a different kernel.

However unrolling loops does not necessarily give you any speedup. After all it only saves on the jump instructions at the expense of program size, whereas in GPU you are usually bound by memory latency.

In your case the bottleneck is probably the sin/cos functions, those are extremely slow on AMD HW (also on other GPU's). You should use native_sin and native_cos. They are not as precise and do not support as long of an range as the normal ones, which is why they don't use them by default, but in most cases they are enough. The precision of the native_ functions is incidentally the same as required by DirectX shaders for sin and cos.



来源:https://stackoverflow.com/questions/19785731/unroll-loops-in-an-amd-opencl-kernel

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!