Cuda global memory load and store

问题

So I am trying to hide global memory latency. Take the following code:

for(int i = 0; i < N; i++){
     x = global_memory[i];

     ... do some computation on x ...

     global_memory[i] = x;
}

I wanted to know whether load and store from global memory is blocking, i.e, it doesn't run next line until load or store is finished. For example take the following code:

x_next = global_memory[0];
for(int i = 0; i < N; i++){
     x = x_next;
     x_next = global_memory[i+1];

     ... do some computation on x ...

     global_memory[i] = x;
}

In this code, x_next is not used until next iteration, so does loading x_next overlap with the computation? In other words, which of the following figures will happen?

回答1:

I wanted to know whether load and store from global memory is blocking, i.e, it doesn't run next line until load or store is finished.

It is not blocking. A load operation does not stall a thread.

Note that the compiler will often seek to unroll loops (and reorder activity) to enable what you are proposing to do "manually".

But in any event your 2nd realization should allow the load of gm[1] to be issued and proceed while the computation being done on gm[0] is proceeding.

Global memory stores are also "fire and forget" -- nonblocking.

来源：https://stackoverflow.com/questions/58310650/cuda-global-memory-load-and-store

标签

cuda

gpu

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!