Cuda global memory load and store
问题 So I am trying to hide global memory latency. Take the following code: for(int i = 0; i < N; i++){ x = global_memory[i]; ... do some computation on x ... global_memory[i] = x; } I wanted to know whether load and store from global memory is blocking, i.e, it doesn't run next line until load or store is finished. For example take the following code: x_next = global_memory[0]; for(int i = 0; i < N; i++){ x = x_next; x_next = global_memory[i+1]; ... do some computation on x ... global_memory[i] =