CUDA profiler reports inefficient global memory access

时间秒杀一切 提交于 2019-11-29 18:07:58
Robert Crovella

I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses.

To take one example, your float4 vel array is stored in memory like this:

0.x 0.y 0.z 0.w 1.x 1.y 1.z 1.w 2.x 2.y 2.z 2.w 3.x 3.y 3.z 3.w ...
  ^               ^               ^               ^             ...
  thread0         thread1         thread2         thread3

So when you do this:

vel[index + offset].x += ...;   // line 247

you are accessing (storing) at the locations (.x) that I have marked above. The gaps in between each ^ mark indicate an inefficient access pattern, which the profiler is pointing out. (It does not matter that in the very next line of code, you are storing to the .y locations.)

There are at least 2 solutions, one of which would be a classical AoS -> SoA reorganization of your data, with appropriate code adjustments. This is well documented (e.g. here on the cuda tag and elsewhere) in terms of what it means, and how to do it, so I will let you look that up.

The other typical solution is to load a float4 quantity per thread, when you need it, and store a float4 quantity per thread, when you need to. Your code can be trivially reworked to do this, which should give improved profiling results:

//preceding code need not change
while(index + offset < numParticles)
{
    float4 my_vel = vel[index + offset];
    float4 my_acc = acc[index + offset];
    my_vel.x += dt*my_acc.x;   
    my_vel.y += dt*my_acc.y;
    my_vel.z += dt*my_acc.z;
    vel[index + offset] = my_vel;

    float4 my_pos = pos[index + offset];
    my_pos.x += dt*my_vel.x; 
    my_pos.y += dt*my_vel.y;
    my_pos.z += dt*my_vel.z;
    pos[index + offset] = my_pos;

    offset += blockDim.x * gridDim.x;
}

Even though you might think that this code is "less efficient" than your code, because your code "appears" to be only loading and storing .x, .y, .z, whereas mine "appears" to also load and store .w, in fact there is essentially no difference, due to the way a GPU loads and stores to/from global memory. Although your code does not appear to touch .w, in the process of accessing the adjacent elements, the GPU will load the .w elements from global memory, and also (eventually) store the .w elements back to global memory.

What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247

For line 247 in your original code, you are accessing one float quantity per thread for the load operation of acc.x, and one float quantity per thread for the load operation of vel.x. A float quantity per thread by itself should require 128 bytes for a warp, which is 4 32-byte L2 cachelines. Two loads together would require 8 L2 cacheline loads. This is the ideal case, which assumes that the quantities are packed together nicely (SoA). But that is not what you have (you have AoS).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!